loop transformation Latest Research Papers

Stencil computation optimizations have been investigated quite a lot, and various approaches have been proposed. Loop transformation is a vital kind of optimization in modern production compilers and has proved successful employment within compilers. In this paper, we combine the two aspects to study the potential benefits some common transformation recipes may have for stencils. The recipes consist of loop unrolling, loop fusion, address precalculation, redundancy elimination, instruction reordering, load balance, and a forward and backward update algorithm named semi-stencil. Experimental evaluations of diverse stencil kernels, including 1D, 2D, and 3D computation patterns, on two typical ARM and Intel platforms, demonstrate the respective effects of the transformation recipes. An average speedup of 1.65× is obtained, and the best is 1.88× for the single transformation recipes we analyze. The compound recipes demonstrate a maximum speedup of 1.92×.

Download Full-text

Reducing Memory Access Conflicts with Loop Transformation and Data Reuse on Coarse-grained Reconfigurable Architecture

10.23919/date51398.2021.9473971 ◽

2021 ◽

Author(s):

Yuge Chen ◽

Zhongyuan Zhao ◽

Jianfei Jiang ◽

Guanghui He ◽

Zhigang Mao ◽

...

Keyword(s):

Data Reuse ◽

Reconfigurable Architecture ◽

Memory Access ◽

Coarse Grained ◽

Loop Transformation

Download Full-text

Effect of Cr Concentration on ½ to Dislocation Loop Transformation in Fe-Cr alloys

Journal of Nuclear Materials ◽

10.1016/j.jnucmat.2020.152592 ◽

2021 ◽

Vol 543 ◽

pp. 152592

Author(s):

Yaxuan Zhang ◽

Ziqi Xiao ◽

Xian-Ming Bai

Keyword(s):

Dislocation Loop ◽

Loop Transformation ◽

Cr Concentration

Download Full-text

Accuracy of view factor calculations for digital terrain models of comets and asteroids

Astronomy and Astrophysics ◽

10.1051/0004-6361/202038462 ◽

2020 ◽

Vol 642 ◽

pp. A167

Author(s):

L. Rezac ◽

Y. Zhao

Keyword(s):

Thermal Evolution ◽

View Factor ◽

Loop Transformation ◽

Practical Applications ◽

Critical Elements ◽

Digital Terrain ◽

Average Accuracy ◽

Order Of Magnitude ◽

View Factors ◽

Shared Edge

Context. Detailed shape and topographic models coupled with sophisticated thermal physics are critical elements to proper characterization of surfaces of small bodies in our solar system. Calculations of self-heating effects are especially important in the context of thermal evolution of non-convex surfaces, including craters, cracks, or openings between “rocks”. Aims. Our aim is to provide quantitative comparisons of multiple numerical methods for computing view factors for concave geometries and provide a more rigorous criteria for the validity of their application. Methods. We contrasted five methods of estimating the view factors. First, we studied specific geometries, including shared-edge facets for a reduced two-facet problem. Then, we applied these methods to the shape model of 67P/Churyumov-Gerasimenko. Nevertheless, the presented results are general and could be extended to shape models of other bodies as well. Results. The close loop transformation of the double area integration method for evaluating view factors of nearby or shared-edge facets is the most accurate, although computationally expensive. Two methods of facet subdivision we evaluate in this work provide reasonably accurate results for modest facet subdivision numbers, however, may result in a degraded performance for specific facet geometries. Increasing the number of subdivisions improves their accuracy, but also increases their computational burden. In practical applications, a trade-off between accuracy and computational speed has to be found, therefore, we propose a combined method based on a simple metric that incorporates a conditional application of various methods and an adaptive number of subdivisions. In our study case of a pit on 67P/CG, this method can reach average accuracy of 2–3% while being about an order of magnitude faster than the (most accurate) line integral method.

Download Full-text

Refactoring the Memory Access Pattern to Improve Computational Performance in NEMO

10.5194/egusphere-egu2020-9732 ◽

2020 ◽

Author(s):

Italo Epicoco ◽

Francesca Mele ◽

Silvia Mocavero ◽

Marco Chiarelli ◽

Alessandro D'Anca ◽

...

Keyword(s):

Memory Performance ◽

Optimization Technique ◽

Main Memory ◽

The Body ◽

Cache Memory ◽

Loop Transformation ◽

Data Dependencies ◽

The Third ◽

Computational Performance ◽

Loop Fusion

In the roadmap of modern parallel architectures development, the computing power of a node grows much more quickly than main memory performance (capacity, bandwidth). This leads to an even much higher gap between computing and memory resources. An efficient use of the cache memory is becoming ever more essential as optimization technique. The NEMO model uses a finite difference integration method and a regular cartesian grid for space discretization. The NEMO code reflects this choice: a generic field is represented in memory as a 3D array; and the code is mainly composed of three-level nested loops. These loops often include only a few operations in the body; the results are stored in a temporary 3D array and then used in subsequent loops until the final calculation. The aim of this work is to make better use of the cache memory by fusing DO loops together. The loop fusion is a transformation which takes two or more adjacent loops that have the same iteration space traversal and combines their bodies into a single loop. The fusion of the loops is not trivial, and it could require introducing additional redundant operations to solve data dependencies. Unfortunately, this leads to a drawback of the overall performance. To avoid the redundant operation, we can adopt pointers to arrays and implement a pointer rotation at each loop iteration. We have developed the loop fusion transformation in an advection kernel extracted from the NEMO oceanic model. We have compared 3 different versions of the optimized advection kernel, with 3 different levels of loop fusion. The first prototype refers to the implementation where the extreme fusion is applied, and all loops in the routine have been fused. In this version, the operations are replicated up to 3 times. In the second prototype the buffer rotation has been applied only in the outermost loop. In the third prototype, the buffer rotation has also been implemented for the second dimension, and this version introduces only a limited amount of redundant operations.The tests have been performed on the Athena cluster located at the CMCC supercomputing center. The supercomputing infrastructure is based on the Intel Xeon E5-2670 processors. The memory hierarchy is composed of 32KB of L1 cache, 256KB of L2 and 20MB L3 cache shared among the cores. The results clearly proved the effectiveness of the loop fusion approach that reaches a speedup of 2x with a high number of cores. The third prototype has proven to be the most promising solution. Prototypes 1 and 2 provide a good improvement up to 256 cores then the redundant operations lead to a loss of performance. A deeper analysis measuring the Last Level Cache misses also showed how the loop transformation significantly reduced the number of cache misses. Despite the good results achieved with the loop fusion optimization, we can remark that this optimization is strictly linked to the computing architecture. A fully portable performance improvement can be ensured by the adoption of a DSL (Domain Specific Language).

Download Full-text

Loop Transformation Algorithm for Test Vector Accessing at High Speed

ISTFA 2019: Conference Proceedings from the 45th International Symposium for Testing and Failure Analysis ◽

10.31399/asm.cp.istfa2019p0434 ◽

2019 ◽

Author(s):

Bjorn Dahlberg ◽

Martin Versen

Keyword(s):

Failure Analysis ◽

High Speed ◽

Semiconductor Devices ◽

Test Vector ◽

Memory Architecture ◽

Critical Importance ◽

Loop Transformation ◽

Test Speed ◽

Transformation Algorithm ◽

Test Vectors

Abstract Looping on test vectors is a widespread requirement in failure analysis of semiconductor devices. The start of the loop and the number of vectors in the loop can be of critical importance. Present-day vector memory architecture tends to impose restrictions on both due to test speed requirements. A new Vector Loop Transformation algorithm is introduced to remedy the tester constraints.

Download Full-text