loop nests Latest Research Papers

Marvel: A Data-Centric Approach for Mapping Deep Learning Operators on Spatial Accelerators

ACM Transactions on Architecture and Code Optimization ◽

10.1145/3485137 ◽

2022 ◽

Vol 19 (1) ◽

pp. 1-26

Author(s):

Prasanth Chatarasi ◽

Hyoukjun Kwon ◽

Angshuman Parashar ◽

Michael Pellauer ◽

Tushar Krishna ◽

...

Keyword(s):

Deep Learning ◽

Cost Model ◽

Cost Models ◽

Mapping Space ◽

Loop Nest ◽

Loop Nests ◽

Higher Dimensional ◽

On Chip ◽

The Cost ◽

Dimensional Mapping

A spatial accelerator’s efficiency depends heavily on both its mapper and cost models to generate optimized mappings for various operators of DNN models. However, existing cost models lack a formal boundary over their input programs (operators) for accurate and tractable cost analysis of the mappings, and this results in adaptability challenges to the cost models for new operators. We consider the recently introduced Maestro Data-Centric (MDC) notation and its analytical cost model to address this challenge because any mapping expressed in the notation is precisely analyzable using the MDC’s cost model. In this article, we characterize the set of input operators and their mappings expressed in the MDC notation by introducing a set of conformability rules . The outcome of these rules is that any loop nest that is perfectly nested with affine tensor subscripts and without conditionals is conformable to the MDC notation. A majority of the primitive operators in deep learning are such loop nests. In addition, our rules enable us to automatically translate a mapping expressed in the loop nest form to MDC notation and use the MDC’s cost model to guide upstream mappers. Our conformability rules over the input operators result in a structured mapping space of the operators, which enables us to introduce a mapper based on our decoupled off-chip/on-chip approach to accelerate mapping space exploration. Our mapper decomposes the original higher-dimensional mapping space of operators into two lower-dimensional off-chip and on-chip subspaces and then optimizes the off-chip subspace followed by the on-chip subspace. We implemented our overall approach in a tool called Marvel , and a benefit of our approach is that it applies to any operator conformable with the MDC notation. We evaluated Marvel over major DNN operators and compared it with past optimizers.

A practical tile size selection model for affine loop nests

Proceedings of the ACM International Conference on Supercomputing ◽

10.1145/3447818.3462213 ◽

2021 ◽

Author(s):

Kumudha Narasimhan ◽

Aravind Acharya ◽

Abhinav Baid ◽

Uday Bondhugula

Keyword(s):

Selection Model ◽

Size Selection ◽

Tile Size ◽

Loop Nests

AutoParallel: Automatic parallelisation and distributed execution of affine loop nests in Python

The International Journal of High Performance Computing Applications ◽

10.1177/1094342020937050 ◽

2020 ◽

Vol 34 (6) ◽

pp. 659-675

Author(s):

Cristian Ramon-Cortes ◽

Ramon Amela ◽

Jorge Ejarque ◽

Philippe Clauss ◽

Rosa M. Badia

Keyword(s):

Programming Languages ◽

Scale Up ◽

Distributed Applications ◽

Loop Nests ◽

Sequential Programming ◽

Distributed Execution ◽

One Step ◽

The One ◽

Blocked Algorithms ◽

Distributed And Parallel Computing

The last improvements in programming languages and models have focused on simplicity and abstraction; leading Python to the top of the list of the programming languages. However, there is still room for improvement when preventing users from dealing directly with distributed and parallel computing issues. This paper proposes and evaluates AutoParallel, a Python module to automatically find an appropriate task-based parallelisation of affine loop nests and execute them in parallel in a distributed computing infrastructure. It is based on sequential programming and contains one single annotation (in the form of a Python decorator) so that anyone with intermediate-level programming skills can scale up an application to hundreds of cores. The evaluation demonstrates that AutoParallel goes one step further in easing the development of distributed applications. On the one hand, the programmability evaluation highlights the benefits of using a single Python decorator instead of manually annotating each task and its parameters or, even worse, having to develop the parallel code explicitly (e.g., using OpenMP, MPI). On the other hand, the performance evaluation demonstrates that AutoParallel is capable of automatically generating task-based workflows from sequential Python code while achieving the same performances than manually taskified versions of established state-of-the-art algorithms (i.e., Cholesky, LU, and QR decompositions). Finally, AutoParallel is also capable of automatically building data blocks to increase the tasks’ granularity; freeing the user from creating the data chunks, and re-designing the algorithm. For advanced users, we believe that this feature can be useful as a baseline to design blocked algorithms.

Reconstruction of Multi-Dimensional Form of Linearized Accesses to Arrays in SAPFOR

Russian Digital Libraries Journal ◽

10.26907/1562-5419-2020-23-4-770-787 ◽

2020 ◽

Vol 23 (4) ◽

pp. 770-787

Author(s):

Nikita Andreevich Kataev ◽

Vladislav Nikolaevich Vasilkin

Keyword(s):

Programming Languages ◽

Program Analysis ◽

Dimensional Structure ◽

Loop Nests ◽

Data Dependence Analysis ◽

C Programming ◽

Program Parallelization ◽

Hard Problems ◽

Automate Program

The system for automated parallelization SAPFOR (System FOR Automated Parallelization) includes tools for program analysis and transformation. The main goal of the system is to reduce the complexity of program parallelization. SAPFOR system is focused on the investigation of multilingual applications in Fortran and C programming languages. The low-level LLVM IR representation is used in SAPFOR for program analysis. This representation allows us to perform various IR-level optimizations to improve the quality of program analysis. At the same time, it loses some features of the program, which are available in its higher level representation. One of these features is the multi-dimensional structure of the arrays. Data dependence analysis is one of the main problems which should be solved to automate program parallelization. Moreover, such an analysis belongs to the class of NP-hard problems. Knowledge of the multidimensional structure of arrays allows in many cases to take into account the structure of index expressions in calls to arrays and reduce the complexity of the analysis. In addition, the use of multi-dimensional arrays allows us to use multi-dimensional processor matrix and to parallelize a whole loop nests, rather than a single loop in the nest. So, parallelism of a program is going to be increased. These opportunities are natively supported in the DVM system. This paper discusses the approach used in the SAPFOR system to recover the form of multi-dimensional arrays by their linearized representation in LLVM IR. The proposed approach has been successfully evaluated on various applications including performance tests from the NAS Parallel Benchmarks suite.

Study of the vectorization efficiency of loop nests with an irregular number of iterations

Program systems theory and applications ◽

10.25209/2079-3316-2019-10-4-77-96 ◽

2019 ◽

Vol 10 (4) ◽

pp. 77-96

Author(s):

Алексей Анатольевич Рыбаков ◽

Сергей Сергеевич Шумилин

Keyword(s):

Loop Nests ◽

Number Of Iterations

Векторизация вычислений является важной низкоуровневой оптимизацией, используемой для создания высокоэффективного параллельного кода. Особенности набора инструкций AVX-512 позволяют применять векторизацию для сложного программного контекста, в частности для гнезд циклов и циклов с сильно разветвленным управлением. При использовании векторных инструкций для контекста с неизвестным профилем исполнения существует опасность низкой эффективности векторизации. Особенно ярко это проявляется при векторизации гнезд циклов с нерегулярным числом итераций внутреннего цикла. В статье рассматривается практический подход к векторизации гнезд циклов, основанный на предикатном представлении программы. В качестве примера приводится реализация сортировки Шелла, компактная реализация которой состоит из гнезда циклов, в котором количество итераций внутреннего цикла носит нерегулярный характер и зависит от номеров итераций внешних циклов. Такой контекст является крайне неудобным для векторизации. Приводится сравнение теоретической и практической эффективности векторизации сортировки Шелла, рассматриваются особенности этого программного контекста и объясняется их негативное влияние на производительность векторизованного кода. Полученные результаты могут быть использованы исследователями и разработчиками программного обеспечения для обнаружения причин низкой эффективности векторизации программного кода с похожими особенностями.

Towards an Achievable Performance for the Loop Nests

Languages and Compilers for Parallel Computing - Lecture Notes in Computer Science ◽

10.1007/978-3-030-34627-0_6 ◽

2019 ◽

pp. 70-77

Author(s):

Aniket Shivam ◽

Neftali Watkinson ◽

Alexandru Nicolau ◽

David Padua ◽

Alexander V. Veidenbaum

Keyword(s):

Loop Nests

Architecture and Synthesis for Area-Efficient Pipelining of Irregular Loop Nests

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems ◽

10.1109/tcad.2017.2664067 ◽

2017 ◽

Vol 36 (11) ◽

pp. 1817-1830 ◽

Cited By ~ 3

Author(s):

Gai Liu ◽

Mingxing Tan ◽

Steve Dai ◽

Ritchie Zhao ◽

Zhiru Zhang

Keyword(s):

Loop Nests ◽

Area Efficient

Construction of optimal space-time mappings for automatic parallelization of loop nests with static control flow

2017 IEEE 11th International Conference on Application of Information and Communication Technologies (AICT) ◽

10.1109/icaict.2017.8686931 ◽

2017 ◽

Author(s):

Artem S. Lebedev

Keyword(s):

Automatic Parallelization ◽

Space Time ◽

Control Flow ◽

Loop Nests ◽

Optimal Space

Organizing communication of parallel processes during automatic parallelization of loop nests with static control flow for cluster systems using polyhedral model

Program systems theory and applications ◽

10.25209/2079-3316-2017-8-4-3-20 ◽

2017 ◽

Vol 8 (4) ◽

pp. 3-20 ◽

Cited By ~ 1

Author(s):

Artem Lebedev ◽

Keyword(s):

Automatic Parallelization ◽

Control Flow ◽

Polyhedral Model ◽

Cluster Systems ◽

Parallel Processes ◽

Loop Nests

Tiling arbitrarily nested loops by means of the transitive

International Journal of Applied Mathematics and Computer Science ◽

10.1515/amcs-2016-0065 ◽

2016 ◽

Vol 26 (4) ◽

pp. 919-939 ◽

Cited By ~ 8

Author(s):

Włodzimierz Bielecki ◽

Marek Pałkowski

Keyword(s):

Transitive Closure ◽

Lexicographic Order ◽

Dependence Graph ◽

Program Transformations ◽

Affine Transformations ◽

Loop Nest ◽

Loop Nests ◽

Novel Approach ◽

Nested Loops ◽

Affine Functions

Abstract A novel approach to generation of tiled code for arbitrarily nested loops is presented. It is derived via a combination of the polyhedral and iteration space slicing frameworks. Instead of program transformations represented by a set of affine functions, one for each statement, it uses the transitive closure of a loop nest dependence graph to carry out corrections of original rectangular tiles so that all dependences of the original loop nest are preserved under the lexicographic order of target tiles. Parallel tiled code can be generated on the basis of valid serial tiled code by means of applying affine transformations or transitive closure using on input an inter-tile dependence graph whose vertices are represented by target tiles while edges connect dependent target tiles. We demonstrate how a relation describing such a graph can be formed. The main merit of the presented approach in comparison with the well-known ones is that it does not require full permutability of loops to generate both serial and parallel tiled codes; this increases the scope of loop nests to be tiled.

loop nests
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Marvel: A Data-Centric Approach for Mapping Deep Learning Operators on Spatial Accelerators

A practical tile size selection model for affine loop nests

AutoParallel: Automatic parallelisation and distributed execution of affine loop nests in Python

Reconstruction of Multi-Dimensional Form of Linearized Accesses to Arrays in SAPFOR

Study of the vectorization efficiency of loop nests with an irregular number of iterations

Towards an Achievable Performance for the Loop Nests

Architecture and Synthesis for Area-Efficient Pipelining of Irregular Loop Nests

Construction of optimal space-time mappings for automatic parallelization of loop nests with static control flow

Organizing communication of parallel processes during automatic parallelization of loop nests with static control flow for cluster systems using polyhedral model

Tiling arbitrarily nested loops by means of the transitive

Export Citation Format

loop nestsRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Marvel: A Data-Centric Approach for Mapping Deep Learning Operators on Spatial Accelerators

A practical tile size selection model for affine loop nests

AutoParallel: Automatic parallelisation and distributed execution of affine loop nests in Python

Reconstruction of Multi-Dimensional Form of Linearized Accesses to Arrays in SAPFOR

Study of the vectorization efficiency of loop nests with an irregular number of iterations

Towards an Achievable Performance for the Loop Nests

Architecture and Synthesis for Area-Efficient Pipelining of Irregular Loop Nests

Construction of optimal space-time mappings for automatic parallelization of loop nests with static control flow

Organizing communication of parallel processes during automatic parallelization of loop nests with static control flow for cluster systems using polyhedral model

Tiling arbitrarily nested loops by means of the transitive

loop nests
Recently Published Documents