scholarly journals Técnicas de otimização em Aceleradores Vetoriais NEC SX-Aurora

2021 ◽  
Author(s):  
Félix Michels ◽  
Matheus Serpa ◽  
Danilo Carastan-Santos ◽  
Lucas Schnorr ◽  
Phillipe Navaux
Keyword(s):  

Avalia-se nesse trabalho a utilização de técnicas de otimização clássicas na nova arquitetura NEC SX-Aurora. Utilizou-se como estudo de caso o benchmark NAS e uma aplicação real de migração sísmica, utilizada pela indústria de petróleo e gás. Os resultados experimentais finais mostram a melhora no desempenho, em FLOPS, utilizando as técnicas de otimização loop unrolling e inlining, no benchmark NAS em até 7, 8× e na aplicação real de migração sísmica em até 1, 9×, em comparação com o desempenho das versões originais.

2008 ◽  
Vol 43 (7) ◽  
pp. 141-150
Author(s):  
Mounira Bachir ◽  
Sid-Ahmed-Ali Touati ◽  
Albert Cohen

2018 ◽  
Vol 228 ◽  
pp. 03008
Author(s):  
Xuehua Liu ◽  
Liping Ding ◽  
Yanfeng Li ◽  
Guangxuan Chen ◽  
Jin Du

Register pressure problem has been a known problem for compiler because of the mismatch between the infinite number of pseudo registers and the finite number of hard registers. Too heavy register pressure may results in register spilling and then leads to performance degradation. There are a lot of optimizations, especially loop optimizations suffer from register spilling in compiler. In order to fight register pressure and therefore improve the effectiveness of compiler, this research takes the register pressure into account to improve loop unrolling optimization during the transformation process. In addition, a register pressure aware transformation is able to reduce the performance overhead of some fine-grained randomization transformations which can be used to defend against ROP attacks. Experiments showed a peak improvement of about 3.6% and an average improvement of about 1% for SPEC CPU 2006 benchmarks and a peak improvement of about 3% and an average improvement of about 1% for the LINPACK benchmark.


2012 ◽  
Vol 10 (4) ◽  
Author(s):  
J Progsch ◽  
Y Ineichen ◽  
A Adelmann

Vector operations play an important role in high performance computing and are typically provided by highly optimized libraries that implement the Basic Linear Algebra Subprograms (BLAS) interface. In C++ templates and operator overloading allow the implementation of these vector operations as expression templates which construct custom loops at compile time and providing a more abstract interface. Unfortunately existing expression template libraries lack the performance of fast BLAS implementations. This paper presents a new approach - Statically Accelerated Loop Templates (SALT) - to close this performance gap by combining expression templates with an aggressive loop unrolling technique. Benchmarks were conducted using the Intel C++ compiler and GNU Compiler Collection to assess the performance of our library relative to Intel's Math Kernel Library as well as the Eigen template library. The results show that the approach is able to provide optimization comparable to the fastest available BLAS implementations, while retaining the convenience and flexibility of a template library.


10.29007/x3tx ◽  
2019 ◽  
Author(s):  
Luka Daoud ◽  
Fady Hussein ◽  
Nader Rafla

Advanced Encryption Standard (AES) represents a fundamental building module of many network security protocols to ensure data confidentiality in various applications ranging from data servers to low-power hardware embedded systems. In order to optimize such hardware implementations, High-Level Synthesis (HLS) provides flexibility in designing and rapid optimization of dedicated hardware to meet the design constraints. In this paper, we present the implementation of AES encryption processor on FPGA using Xilinx Vivado HLS. The AES architecture was analyzed and designed by loop unrolling, and inner-round and outer-round pipelining techniques to achieve a maximum throughput of the AES algorithm up to 1290 Mbps (Mega bit per second) with very significant low resources of 3.24% slices of the FPGA, achieving 3 Mbps per slice area.


Author(s):  
Wei Zhang ◽  
Xing Cai

AbstractNumerically solving time-fractional diffusion equations, especially in three space dimensions, is a daunting computational task. This is due to the huge requirements of both computation time and memory storage. Compared with solving integer-ordered diffusion equations, the costs for time and storage both increase by a factor that equals the number of time steps involved. Aiming to overcome these two obstacles, we study in this paper three programming techniques: loop unrolling, vectorization and parallelization. For a representative numerical scheme that adopts finite differencing and explicit time integration, the performance-enhancing techniques are indeed shown to dramatically reduce the computation time, while allowing the use of many CPU cores and thereby a large amount of memory storage. Moreover, we have developed simple-to-use performance models that support our empirical findings, which are based on using up to 8192 CPU cores and 12.2 terabytes.


Sign in / Sign up

Export Citation Format

Share Document