loop nests
Recently Published Documents


TOTAL DOCUMENTS

85
(FIVE YEARS 6)

H-INDEX

15
(FIVE YEARS 0)

2022 ◽  
Vol 19 (1) ◽  
pp. 1-26
Author(s):  
Prasanth Chatarasi ◽  
Hyoukjun Kwon ◽  
Angshuman Parashar ◽  
Michael Pellauer ◽  
Tushar Krishna ◽  
...  

A spatial accelerator’s efficiency depends heavily on both its mapper and cost models to generate optimized mappings for various operators of DNN models. However, existing cost models lack a formal boundary over their input programs (operators) for accurate and tractable cost analysis of the mappings, and this results in adaptability challenges to the cost models for new operators. We consider the recently introduced Maestro Data-Centric (MDC) notation and its analytical cost model to address this challenge because any mapping expressed in the notation is precisely analyzable using the MDC’s cost model. In this article, we characterize the set of input operators and their mappings expressed in the MDC notation by introducing a set of conformability rules . The outcome of these rules is that any loop nest that is perfectly nested with affine tensor subscripts and without conditionals is conformable to the MDC notation. A majority of the primitive operators in deep learning are such loop nests. In addition, our rules enable us to automatically translate a mapping expressed in the loop nest form to MDC notation and use the MDC’s cost model to guide upstream mappers. Our conformability rules over the input operators result in a structured mapping space of the operators, which enables us to introduce a mapper based on our decoupled off-chip/on-chip approach to accelerate mapping space exploration. Our mapper decomposes the original higher-dimensional mapping space of operators into two lower-dimensional off-chip and on-chip subspaces and then optimizes the off-chip subspace followed by the on-chip subspace. We implemented our overall approach in a tool called Marvel , and a benefit of our approach is that it applies to any operator conformable with the MDC notation. We evaluated Marvel over major DNN operators and compared it with past optimizers.


Author(s):  
Kumudha Narasimhan ◽  
Aravind Acharya ◽  
Abhinav Baid ◽  
Uday Bondhugula

Author(s):  
Cristian Ramon-Cortes ◽  
Ramon Amela ◽  
Jorge Ejarque ◽  
Philippe Clauss ◽  
Rosa M. Badia

The last improvements in programming languages and models have focused on simplicity and abstraction; leading Python to the top of the list of the programming languages. However, there is still room for improvement when preventing users from dealing directly with distributed and parallel computing issues. This paper proposes and evaluates AutoParallel, a Python module to automatically find an appropriate task-based parallelisation of affine loop nests and execute them in parallel in a distributed computing infrastructure. It is based on sequential programming and contains one single annotation (in the form of a Python decorator) so that anyone with intermediate-level programming skills can scale up an application to hundreds of cores. The evaluation demonstrates that AutoParallel goes one step further in easing the development of distributed applications. On the one hand, the programmability evaluation highlights the benefits of using a single Python decorator instead of manually annotating each task and its parameters or, even worse, having to develop the parallel code explicitly (e.g., using OpenMP, MPI). On the other hand, the performance evaluation demonstrates that AutoParallel is capable of automatically generating task-based workflows from sequential Python code while achieving the same performances than manually taskified versions of established state-of-the-art algorithms (i.e., Cholesky, LU, and QR decompositions). Finally, AutoParallel is also capable of automatically building data blocks to increase the tasks’ granularity; freeing the user from creating the data chunks, and re-designing the algorithm. For advanced users, we believe that this feature can be useful as a baseline to design blocked algorithms.


2020 ◽  
Vol 23 (4) ◽  
pp. 770-787
Author(s):  
Nikita Andreevich Kataev ◽  
Vladislav Nikolaevich Vasilkin

The system for automated parallelization SAPFOR (System FOR Automated Parallelization) includes tools for program analysis and transformation. The main goal of the system is to reduce the complexity of program parallelization. SAPFOR system is focused on the investigation of multilingual applications in Fortran and C programming languages. The low-level LLVM IR representation is used in SAPFOR for program analysis. This representation allows us to perform various IR-level optimizations to improve the quality of program analysis. At the same time, it loses some features of the program, which are available in its higher level representation. One of these features is the multi-dimensional structure of the arrays. Data dependence analysis is one of the main problems which should be solved to automate program parallelization. Moreover, such an analysis belongs to the class of NP-hard problems. Knowledge of the multidimensional structure of arrays allows in many cases to take into account the structure of index expressions in calls to arrays and reduce the complexity of the analysis. In addition, the use of multi-dimensional arrays allows us to use multi-dimensional processor matrix and to parallelize a whole loop nests, rather than a single loop in the nest. So, parallelism of a program is going to be increased. These opportunities are natively supported in the DVM system. This paper discusses the approach used in the SAPFOR system to recover the form of multi-dimensional arrays by their linearized representation in LLVM IR. The proposed approach has been successfully evaluated on various applications including performance tests from the NAS Parallel Benchmarks suite.


2019 ◽  
Vol 10 (4) ◽  
pp. 77-96
Author(s):  
Алексей Анатольевич Рыбаков ◽  
Сергей Сергеевич Шумилин

Векторизация вычислений является важной низкоуровневой оптимизацией, используемой для создания высокоэффективного параллельного кода. Особенности набора инструкций AVX-512 позволяют применять векторизацию для сложного программного контекста, в частности для гнезд циклов и циклов с сильно разветвленным управлением. При использовании векторных инструкций для контекста с неизвестным профилем исполнения существует опасность низкой эффективности векторизации. Особенно ярко это проявляется при векторизации гнезд циклов с нерегулярным числом итераций внутреннего цикла. В статье рассматривается практический подход к векторизации гнезд циклов, основанный на предикатном представлении программы. В качестве примера приводится реализация сортировки Шелла, компактная реализация которой состоит из гнезда циклов, в котором количество итераций внутреннего цикла носит нерегулярный характер и зависит от номеров итераций внешних циклов. Такой контекст является крайне неудобным для векторизации. Приводится сравнение теоретической и практической эффективности векторизации сортировки Шелла, рассматриваются особенности этого программного контекста и объясняется их негативное влияние на производительность векторизованного кода. Полученные результаты могут быть использованы исследователями и разработчиками программного обеспечения для обнаружения причин низкой эффективности векторизации программного кода с похожими особенностями.


Author(s):  
Aniket Shivam ◽  
Neftali Watkinson ◽  
Alexandru Nicolau ◽  
David Padua ◽  
Alexander V. Veidenbaum
Keyword(s):  

Author(s):  
Gai Liu ◽  
Mingxing Tan ◽  
Steve Dai ◽  
Ritchie Zhao ◽  
Zhiru Zhang
Keyword(s):  

2016 ◽  
Vol 26 (4) ◽  
pp. 919-939 ◽  
Author(s):  
Włodzimierz Bielecki ◽  
Marek Pałkowski

Abstract A novel approach to generation of tiled code for arbitrarily nested loops is presented. It is derived via a combination of the polyhedral and iteration space slicing frameworks. Instead of program transformations represented by a set of affine functions, one for each statement, it uses the transitive closure of a loop nest dependence graph to carry out corrections of original rectangular tiles so that all dependences of the original loop nest are preserved under the lexicographic order of target tiles. Parallel tiled code can be generated on the basis of valid serial tiled code by means of applying affine transformations or transitive closure using on input an inter-tile dependence graph whose vertices are represented by target tiles while edges connect dependent target tiles. We demonstrate how a relation describing such a graph can be formed. The main merit of the presented approach in comparison with the well-known ones is that it does not require full permutability of loops to generate both serial and parallel tiled codes; this increases the scope of loop nests to be tiled.


Sign in / Sign up

Export Citation Format

Share Document