Técnicas de otimização em Aceleradores Vetoriais NEC SX-Aurora

Mapping Intimacies ◽

10.5753/eradrs.2021.14792 ◽

2021 ◽

Author(s):

Félix Michels ◽

Matheus Serpa ◽

Danilo Carastan-Santos ◽

Lucas Schnorr ◽

Phillipe Navaux

Keyword(s):

Loop Unrolling

Avalia-se nesse trabalho a utilização de técnicas de otimização clássicas na nova arquitetura NEC SX-Aurora. Utilizou-se como estudo de caso o benchmark NAS e uma aplicação real de migração sísmica, utilizada pela indústria de petróleo e gás. Os resultados experimentais ﬁnais mostram a melhora no desempenho, em FLOPS, utilizando as técnicas de otimização loop unrolling e inlining, no benchmark NAS em até 7, 8× e na aplicação real de migração sísmica em até 1, 9×, em comparação com o desempenho das versões originais.

Download Full-text

Effects of Loop Unrolling and Loop Fusion on Register Pressure and Code Performance.

10.21236/ada326916 ◽

1997 ◽

Cited By ~ 1

Author(s):

Dale Shires

Keyword(s):

Loop Unrolling ◽

Code Performance ◽

Loop Fusion ◽

Download Full-text

Post-pass periodic register allocation to minimise loop unrolling degree

ACM SIGPLAN Notices ◽

10.1145/1379023.1375677 ◽

2008 ◽

Vol 43 (7) ◽

pp. 141-150

Author(s):

Mounira Bachir ◽

Sid-Ahmed-Ali Touati ◽

Albert Cohen

Keyword(s):

Loop Unrolling

Download Full-text

Loop unrolling effect on parallel code optimization

Proceedings of the 2nd International Conference on Future Networks and Distributed Systems - ICFNDS '18 ◽

10.1145/3231053.3231060 ◽

2018 ◽

Author(s):

Karim Soliman ◽

Marwa El Shenawy ◽

Ahmed Abou El Farag

Keyword(s):

Code Optimization ◽

Loop Unrolling ◽

Parallel Code

Download Full-text

Research of Register Pressure Aware Loop Unrolling Optimizations for Compiler

MATEC Web of Conferences ◽

10.1051/matecconf/201822803008 ◽

2018 ◽

Vol 228 ◽

pp. 03008

Author(s):

Xuehua Liu ◽

Liping Ding ◽

Yanfeng Li ◽

Guangxuan Chen ◽

Jin Du

Keyword(s):

Finite Number ◽

Infinite Number ◽

Performance Degradation ◽

Transformation Process ◽

Fine Grained ◽

Loop Unrolling ◽

Average Improvement ◽

Linpack Benchmark ◽

Loop Optimizations

Register pressure problem has been a known problem for compiler because of the mismatch between the infinite number of pseudo registers and the finite number of hard registers. Too heavy register pressure may results in register spilling and then leads to performance degradation. There are a lot of optimizations, especially loop optimizations suffer from register spilling in compiler. In order to fight register pressure and therefore improve the effectiveness of compiler, this research takes the register pressure into account to improve loop unrolling optimization during the transformation process. In addition, a register pressure aware transformation is able to reduce the performance overhead of some fine-grained randomization transformations which can be used to defend against ROP attacks. Experiments showed a peak improvement of about 3.6% and an average improvement of about 1% for SPEC CPU 2006 benchmarks and a peak improvement of about 3% and an average improvement of about 1% for the LINPACK benchmark.

Download Full-text

A New Vectorization Technique for Expression Templates in C++

American Journal of Undergraduate Research ◽

10.33697/ajur.2012.003 ◽

2012 ◽

Vol 10 (4) ◽

Author(s):

J Progsch ◽

Y Ineichen ◽

A Adelmann

Keyword(s):

High Performance ◽

Performance Gap ◽

New Approach ◽

Loop Unrolling ◽

Basic Linear Algebra Subprograms ◽

Template Library ◽

Abstract Interface ◽

Expression Templates ◽

Performance Computing ◽

Expression Template

Vector operations play an important role in high performance computing and are typically provided by highly optimized libraries that implement the Basic Linear Algebra Subprograms (BLAS) interface. In C++ templates and operator overloading allow the implementation of these vector operations as expression templates which construct custom loops at compile time and providing a more abstract interface. Unfortunately existing expression template libraries lack the performance of fast BLAS implementations. This paper presents a new approach - Statically Accelerated Loop Templates (SALT) - to close this performance gap by combining expression templates with an aggressive loop unrolling technique. Benchmarks were conducted using the Intel C++ compiler and GNU Compiler Collection to assess the performance of our library relative to Intel's Math Kernel Library as well as the Eigen template library. The results show that the approach is able to provide optimization comparable to the fastest available BLAS implementations, while retaining the convenience and flexibility of a template library.

Download Full-text

Optimization of Advanced Encryption Standard (AES) Using Vivado High Level Synthesis (HLS)

10.29007/x3tx ◽

2019 ◽

Author(s):

Luka Daoud ◽

Fady Hussein ◽

Nader Rafla

Keyword(s):

High Level Synthesis ◽

Advanced Encryption Standard ◽

Data Confidentiality ◽

Maximum Throughput ◽

Loop Unrolling ◽

Aes Algorithm ◽

Hardware Implementations ◽

High Level ◽

Dedicated Hardware ◽

Vivado Hls

Advanced Encryption Standard (AES) represents a fundamental building module of many network security protocols to ensure data confidentiality in various applications ranging from data servers to low-power hardware embedded systems. In order to optimize such hardware implementations, High-Level Synthesis (HLS) provides flexibility in designing and rapid optimization of dedicated hardware to meet the design constraints. In this paper, we present the implementation of AES encryption processor on FPGA using Xilinx Vivado HLS. The AES architecture was analyzed and designed by loop unrolling, and inner-round and outer-round pipelining techniques to achieve a maximum throughput of the AES algorithm up to 1290 Mbps (Mega bit per second) with very significant low resources of 3.24% slices of the FPGA, achieving 3 Mbps per slice area.

Download Full-text

Solving 3D Time-Fractional Diffusion Equations by High-Performance Parallel Computing

Fractional Calculus and Applied Analysis ◽

10.1515/fca-2016-0008 ◽

2016 ◽

Vol 19 (1) ◽

Cited By ~ 2

Author(s):

Wei Zhang ◽

Xing Cai

Keyword(s):

High Performance ◽

Time Integration ◽

Computation Time ◽

Fractional Diffusion ◽

Memory Storage ◽

Diffusion Equations ◽

Fractional Diffusion Equations ◽

Loop Unrolling ◽

Programming Techniques ◽

Three Space

AbstractNumerically solving time-fractional diffusion equations, especially in three space dimensions, is a daunting computational task. This is due to the huge requirements of both computation time and memory storage. Compared with solving integer-ordered diffusion equations, the costs for time and storage both increase by a factor that equals the number of time steps involved. Aiming to overcome these two obstacles, we study in this paper three programming techniques: loop unrolling, vectorization and parallelization. For a representative numerical scheme that adopts finite differencing and explicit time integration, the performance-enhancing techniques are indeed shown to dramatically reduce the computation time, while allowing the use of many CPU cores and thereby a large amount of memory storage. Moreover, we have developed simple-to-use performance models that support our empirical findings, which are based on using up to 8192 CPU cores and 12.2 terabytes.

Download Full-text