loop unrolling
Recently Published Documents


TOTAL DOCUMENTS

108
(FIVE YEARS 22)

H-INDEX

10
(FIVE YEARS 1)

2021 ◽  
Vol 35 (6) ◽  
pp. 511-517
Author(s):  
Malathi Devendran ◽  
Indumathi Rajendran ◽  
Vijayakumar Ponnusamy ◽  
Diwakar R. Marur

In recent years, machine learning algorithms related to images have been widely utilized by Convolution Neural Networks (CNN), and it has a high accuracy for recognition of an image. As CNN contains large number of computations, hardware accelerator like Field Programmable Gate Array is employed. Quite 90 % of operations during a CNN involves convolution. The objective of this work is to scale back the computation time to increase the peak, width and the pixel intensity levels in the input image. The execution time of a image processing program is mostly spent on loops. Loop optimization is a process of accelerating speed and reducing the overheads related to loops. It plays a crucial role in improving performance and making effective use of multiprocessing capabilities. Loop unrolling is one of the loop optimization techniques. In our work CNN with four levels of loop unrolling is used. Due to this delay is reduced compared with conventional Xilinix. With the assistance of strides and padding the 40 % of computation time has been reduced and is verified in MATLAB.


Electronics ◽  
2021 ◽  
Vol 11 (1) ◽  
pp. 38
Author(s):  
Huayou Su ◽  
Kaifang Zhang ◽  
Songzhu Mei

Stencil computation optimizations have been investigated quite a lot, and various approaches have been proposed. Loop transformation is a vital kind of optimization in modern production compilers and has proved successful employment within compilers. In this paper, we combine the two aspects to study the potential benefits some common transformation recipes may have for stencils. The recipes consist of loop unrolling, loop fusion, address precalculation, redundancy elimination, instruction reordering, load balance, and a forward and backward update algorithm named semi-stencil. Experimental evaluations of diverse stencil kernels, including 1D, 2D, and 3D computation patterns, on two typical ARM and Intel platforms, demonstrate the respective effects of the transformation recipes. An average speedup of 1.65× is obtained, and the best is 1.88× for the single transformation recipes we analyze. The compound recipes demonstrate a maximum speedup of 1.92×.


2021 ◽  
Author(s):  
Yue Li

This dissertation investigates adaptive decision feedback equalizers for high-speed serial data links.<div>An adaptive data-transition decision feedback equalizer (DT-DFE) was developed. The DT-DFE boosts the eye-opening of the high-frequency components of data without attenuating their low-frequency counterparts. Reference voltages were obtained by transmitting consecutive 1s and 0s and measuring the output of the continuous-time linear equalizer using a pair of successive approximation register analog-to-digital converters in a training phase. It uses loop unrolling to detect data transitions, activate tap-tuning, launch DFE, and combat timing constraints. The performance of the DT-DFE and its advantages over commonly used data-state DFE were validated using the schematic-level simulation results of 5 Gbps backplane links.<br></div><div>A new adaptive DT-DFE with edge-emphasis (EE) taps and raised references was developed. Loop-unrolling was further developed for DT-DFE with EE-taps. The reference voltages were raised beyond that set by the low-frequency components of data to increase vertical eye-opening. Clock and data recovery was performed using 4x oversampling. The DT-DFE was validated using the schematiclevel simulation results of 10 Gbps backplane links.<br></div><div>A pre-skewed bi-directional gated delay line (BDGDL) bang-bang frequency difference-to-digital converter and a BDGDL integrating frequency difference-todigital converter (iFDDC) were proposed for clock and data recovery. Both frequency difference detectors feature all-digital realization, low power consumption, and high-speed operation. The built-in integration of iFDDC results in a zero static frequency error and the first-order noise-shaping of the quantization errors of the BDGDL and digitally-controlled oscillators. Their effectiveness was validated using schematic-level simulation results of 5-GHz frequency-locked loops.<br></div><div>All systems validating the proposed adaptive DFE and frequency-difference detectors were designed in TSMC’s 65 nm CMOS technology and analyzed using Spectre from Cadence Design Systems. <br></div>


2021 ◽  
Author(s):  
Yue Li

This dissertation investigates adaptive decision feedback equalizers for high-speed serial data links.<div>An adaptive data-transition decision feedback equalizer (DT-DFE) was developed. The DT-DFE boosts the eye-opening of the high-frequency components of data without attenuating their low-frequency counterparts. Reference voltages were obtained by transmitting consecutive 1s and 0s and measuring the output of the continuous-time linear equalizer using a pair of successive approximation register analog-to-digital converters in a training phase. It uses loop unrolling to detect data transitions, activate tap-tuning, launch DFE, and combat timing constraints. The performance of the DT-DFE and its advantages over commonly used data-state DFE were validated using the schematic-level simulation results of 5 Gbps backplane links.<br></div><div>A new adaptive DT-DFE with edge-emphasis (EE) taps and raised references was developed. Loop-unrolling was further developed for DT-DFE with EE-taps. The reference voltages were raised beyond that set by the low-frequency components of data to increase vertical eye-opening. Clock and data recovery was performed using 4x oversampling. The DT-DFE was validated using the schematiclevel simulation results of 10 Gbps backplane links.<br></div><div>A pre-skewed bi-directional gated delay line (BDGDL) bang-bang frequency difference-to-digital converter and a BDGDL integrating frequency difference-todigital converter (iFDDC) were proposed for clock and data recovery. Both frequency difference detectors feature all-digital realization, low power consumption, and high-speed operation. The built-in integration of iFDDC results in a zero static frequency error and the first-order noise-shaping of the quantization errors of the BDGDL and digitally-controlled oscillators. Their effectiveness was validated using schematic-level simulation results of 5-GHz frequency-locked loops.<br></div><div>All systems validating the proposed adaptive DFE and frequency-difference detectors were designed in TSMC’s 65 nm CMOS technology and analyzed using Spectre from Cadence Design Systems. <br></div>


2021 ◽  
Vol 29 (2) ◽  
Author(s):  
Panadda Solod ◽  
Nattha Jindapetch ◽  
Kiattisak Sengchuai ◽  
Apidet Booranawong ◽  
Pakpoom Hoyingcharoen ◽  
...  

In this work, we proposed High-Level Synthesis (HLS) optimization processes to improve the speed and the resource usage of complex algorithms, especially nested-loop. The proposed HLS optimization processes are divided into four steps: array sizing is performed to decrease the resource usage on Programmable Logic (PL) part, loop analysis is performed to determine which loop must be loop unrolling or loop pipelining, array partitioning is performed to resolve the bottleneck of loop unrolling and loop pipelining, and HLS interface is performed to select the best block level and port level interface for array argument of RTL design. A case study road lane detection was analyzed and applied with suitable optimization techniques to implement on the Xilinx Zynq-7000 family (Zybo ZC7010-1) which was a low-cost FPGA. From the experimental results, our proposed method reaches 6.66 times faster than the primitive method at clock frequency 100 MHz or about 6 FPS. Although the proposed methods cannot reach the standard real-time (25 FPS), they can instruct HLS developers for speed increasing and resource decreasing on an FPGA.


2021 ◽  
Author(s):  
Félix Michels ◽  
Matheus Serpa ◽  
Danilo Carastan-Santos ◽  
Lucas Schnorr ◽  
Phillipe Navaux
Keyword(s):  

Avalia-se nesse trabalho a utilização de técnicas de otimização clássicas na nova arquitetura NEC SX-Aurora. Utilizou-se como estudo de caso o benchmark NAS e uma aplicação real de migração sísmica, utilizada pela indústria de petróleo e gás. Os resultados experimentais finais mostram a melhora no desempenho, em FLOPS, utilizando as técnicas de otimização loop unrolling e inlining, no benchmark NAS em até 7, 8× e na aplicação real de migração sísmica em até 1, 9×, em comparação com o desempenho das versões originais.


Author(s):  
Yutong Ye ◽  
Hongyin Zhu ◽  
Chaoying Zhang ◽  
Binghai Wen

The chemical-potential multiphase lattice Boltzmann method (CP-LBM) has the advantages of satisfying the thermodynamic consistency and Galilean invariance, and it realizes a very large density ratio and easily expresses the surface wettability. Compared with the traditional central difference scheme, the CP-LBM uses the Thomas algorithm to calculate the differences in the multiphase simulations, which significantly improves the calculation accuracy but increases the calculation complexity. In this study, we designed and implemented a parallel algorithm for the chemical-potential model on a graphic processing unit (GPU). Several strategies were used to optimize the GPU algorithm, such as coalesced access, instruction throughput, thread organization, memory access, and loop unrolling. Compared with dual-Xeon 5117 CPU server, our methods achieved 95 times speedup on an NVIDIA RTX 2080Ti GPU and 106 times speedup on an NVIDIA Tesla P100 GPU. When the algorithm was extended to the environment with dual NVIDIA Tesla P100 GPUs, 189 times speedup was achieved and the workload of each GPU reached 96%.


Sign in / Sign up

Export Citation Format

Share Document