loop unrolling Latest Research Papers

In recent years, machine learning algorithms related to images have been widely utilized by Convolution Neural Networks (CNN), and it has a high accuracy for recognition of an image. As CNN contains large number of computations, hardware accelerator like Field Programmable Gate Array is employed. Quite 90 % of operations during a CNN involves convolution. The objective of this work is to scale back the computation time to increase the peak, width and the pixel intensity levels in the input image. The execution time of a image processing program is mostly spent on loops. Loop optimization is a process of accelerating speed and reducing the overheads related to loops. It plays a crucial role in improving performance and making effective use of multiprocessing capabilities. Loop unrolling is one of the loop optimization techniques. In our work CNN with four levels of loop unrolling is used. Due to this delay is reduced compared with conventional Xilinix. With the assistance of strides and padding the 40 % of computation time has been reduced and is verified in MATLAB.

Download Full-text

On the Transformation Optimization for Stencil Computation

Electronics ◽

10.3390/electronics11010038 ◽

2021 ◽

Vol 11 (1) ◽

pp. 38

Author(s):

Huayou Su ◽

Kaifang Zhang ◽

Songzhu Mei

Keyword(s):

Load Balance ◽

Loop Transformation ◽

Redundancy Elimination ◽

Stencil Computation ◽

Loop Unrolling ◽

Loop Fusion ◽

Potential Benefits ◽

Successful Employment ◽

2D And 3D

Stencil computation optimizations have been investigated quite a lot, and various approaches have been proposed. Loop transformation is a vital kind of optimization in modern production compilers and has proved successful employment within compilers. In this paper, we combine the two aspects to study the potential benefits some common transformation recipes may have for stencils. The recipes consist of loop unrolling, loop fusion, address precalculation, redundancy elimination, instruction reordering, load balance, and a forward and backward update algorithm named semi-stencil. Experimental evaluations of diverse stencil kernels, including 1D, 2D, and 3D computation patterns, on two typical ARM and Intel platforms, demonstrate the respective effects of the transformation recipes. An average speedup of 1.65× is obtained, and the best is 1.88× for the single transformation recipes we analyze. The compound recipes demonstrate a maximum speedup of 1.92×.

Download Full-text

Adaptive Data-Transition Decision Feedback Equalizers For High-Speed Serial Links

10.32920/17303675 ◽

2021 ◽

Author(s):

Yue Li

Keyword(s):

High Speed ◽

Low Frequency ◽

Clock And Data Recovery ◽

Frequency Difference ◽

Data Recovery ◽

Decision Feedback ◽

Loop Unrolling ◽

Frequency Components ◽

Eye Opening ◽

Simulation Results

This dissertation investigates adaptive decision feedback equalizers for high-speed serial data links.<div>An adaptive data-transition decision feedback equalizer (DT-DFE) was developed. The DT-DFE boosts the eye-opening of the high-frequency components of data without attenuating their low-frequency counterparts. Reference voltages were obtained by transmitting consecutive 1s and 0s and measuring the output of the continuous-time linear equalizer using a pair of successive approximation register analog-to-digital converters in a training phase. It uses loop unrolling to detect data transitions, activate tap-tuning, launch DFE, and combat timing constraints. The performance of the DT-DFE and its advantages over commonly used data-state DFE were validated using the schematic-level simulation results of 5 Gbps backplane links.<br></div><div>A new adaptive DT-DFE with edge-emphasis (EE) taps and raised references was developed. Loop-unrolling was further developed for DT-DFE with EE-taps. The reference voltages were raised beyond that set by the low-frequency components of data to increase vertical eye-opening. Clock and data recovery was performed using 4x oversampling. The DT-DFE was validated using the schematiclevel simulation results of 10 Gbps backplane links.<br></div><div>A pre-skewed bi-directional gated delay line (BDGDL) bang-bang frequency difference-to-digital converter and a BDGDL integrating frequency difference-todigital converter (iFDDC) were proposed for clock and data recovery. Both frequency difference detectors feature all-digital realization, low power consumption, and high-speed operation. The built-in integration of iFDDC results in a zero static frequency error and the first-order noise-shaping of the quantization errors of the BDGDL and digitally-controlled oscillators. Their effectiveness was validated using schematic-level simulation results of 5-GHz frequency-locked loops.<br></div><div>All systems validating the proposed adaptive DFE and frequency-difference detectors were designed in TSMC’s 65 nm CMOS technology and analyzed using Spectre from Cadence Design Systems. <br></div>

Download Full-text

Adaptive Data-Transition Decision Feedback Equalizers For High-Speed Serial Links

10.32920/17303675.v1 ◽

2021 ◽

Author(s):

Yue Li

Keyword(s):

High Speed ◽

Low Frequency ◽

Clock And Data Recovery ◽

Frequency Difference ◽

Data Recovery ◽

Decision Feedback ◽

Loop Unrolling ◽

Frequency Components ◽

Eye Opening ◽

Simulation Results

This dissertation investigates adaptive decision feedback equalizers for high-speed serial data links.<div>An adaptive data-transition decision feedback equalizer (DT-DFE) was developed. The DT-DFE boosts the eye-opening of the high-frequency components of data without attenuating their low-frequency counterparts. Reference voltages were obtained by transmitting consecutive 1s and 0s and measuring the output of the continuous-time linear equalizer using a pair of successive approximation register analog-to-digital converters in a training phase. It uses loop unrolling to detect data transitions, activate tap-tuning, launch DFE, and combat timing constraints. The performance of the DT-DFE and its advantages over commonly used data-state DFE were validated using the schematic-level simulation results of 5 Gbps backplane links.<br></div><div>A new adaptive DT-DFE with edge-emphasis (EE) taps and raised references was developed. Loop-unrolling was further developed for DT-DFE with EE-taps. The reference voltages were raised beyond that set by the low-frequency components of data to increase vertical eye-opening. Clock and data recovery was performed using 4x oversampling. The DT-DFE was validated using the schematiclevel simulation results of 10 Gbps backplane links.<br></div><div>A pre-skewed bi-directional gated delay line (BDGDL) bang-bang frequency difference-to-digital converter and a BDGDL integrating frequency difference-todigital converter (iFDDC) were proposed for clock and data recovery. Both frequency difference detectors feature all-digital realization, low power consumption, and high-speed operation. The built-in integration of iFDDC results in a zero static frequency error and the first-order noise-shaping of the quantization errors of the BDGDL and digitally-controlled oscillators. Their effectiveness was validated using schematic-level simulation results of 5-GHz frequency-locked loops.<br></div><div>All systems validating the proposed adaptive DFE and frequency-difference detectors were designed in TSMC’s 65 nm CMOS technology and analyzed using Spectre from Cadence Design Systems. <br></div>

Download Full-text

Studying the impacts of loop unrolling and pipeline in the FPGA design of the Simon and RoadRunneR lightweght ciphers

2021 10th International Conference on Modern Circuits and Systems Technologies (MOCAST) ◽

10.1109/mocast52088.2021.9493376 ◽

2021 ◽

Author(s):

G. Georgiou ◽

G. Theodoridis

Keyword(s):

Fpga Design ◽

Loop Unrolling

Download Full-text

The Effect of Loop Unrolling in Energy Efficient Strassen's Algorithm on Shared Memory Architecture

2021 36th International Technical Conference on Circuits/Systems, Computers and Communications (ITC-CSCC) ◽

10.1109/itc-cscc52171.2021.9501472 ◽

2021 ◽

Author(s):

Nwe Zin Oo ◽

Panyayot Chaikan

Keyword(s):

Shared Memory ◽

Energy Efficient ◽

Memory Architecture ◽

Loop Unrolling ◽

Strassen’S Algorithm ◽

Strassen's Algorithm

Download Full-text

High Level Synthesis Optimizations of Road Lane Detection Development on Zynq-7000

Pertanika Journal of Science and Technology ◽

10.47836/pjst.29.2.01 ◽

2021 ◽

Vol 29 (2) ◽

Author(s):

Panadda Solod ◽

Nattha Jindapetch ◽

Kiattisak Sengchuai ◽

Apidet Booranawong ◽

Pakpoom Hoyingcharoen ◽

...

Keyword(s):

Low Cost ◽

Optimization Techniques ◽

Lane Detection ◽

High Level Synthesis ◽

Resource Usage ◽

Clock Frequency ◽

Loop Analysis ◽

Loop Unrolling ◽

Loop Pipelining ◽

High Level

In this work, we proposed High-Level Synthesis (HLS) optimization processes to improve the speed and the resource usage of complex algorithms, especially nested-loop. The proposed HLS optimization processes are divided into four steps: array sizing is performed to decrease the resource usage on Programmable Logic (PL) part, loop analysis is performed to determine which loop must be loop unrolling or loop pipelining, array partitioning is performed to resolve the bottleneck of loop unrolling and loop pipelining, and HLS interface is performed to select the best block level and port level interface for array argument of RTL design. A case study road lane detection was analyzed and applied with suitable optimization techniques to implement on the Xilinx Zynq-7000 family (Zybo ZC7010-1) which was a low-cost FPGA. From the experimental results, our proposed method reaches 6.66 times faster than the primitive method at clock frequency 100 MHz or about 6 FPS. Although the proposed methods cannot reach the standard real-time (25 FPS), they can instruct HLS developers for speed increasing and resource decreasing on an FPGA.

Download Full-text

Técnicas de otimização em Aceleradores Vetoriais NEC SX-Aurora

10.5753/eradrs.2021.14792 ◽

2021 ◽

Author(s):

Félix Michels ◽

Matheus Serpa ◽

Danilo Carastan-Santos ◽

Lucas Schnorr ◽

Phillipe Navaux

Keyword(s):

Loop Unrolling

Avalia-se nesse trabalho a utilização de técnicas de otimização clássicas na nova arquitetura NEC SX-Aurora. Utilizou-se como estudo de caso o benchmark NAS e uma aplicação real de migração sísmica, utilizada pela indústria de petróleo e gás. Os resultados experimentais ﬁnais mostram a melhora no desempenho, em FLOPS, utilizando as técnicas de otimização loop unrolling e inlining, no benchmark NAS em até 7, 8× e na aplicação real de migração sísmica em até 1, 9×, em comparação com o desempenho das versões originais.

Download Full-text

Improving performance for simulating complex fluids on massively parallel computers by component loop-unrolling and communication hiding

2020 IEEE 22nd International Conference on High Performance Computing and Communications; IEEE 18th International Conference on Smart City; IEEE 6th International Conference on Data Science and Systems (HPCC/SmartCity/DSS) ◽

10.1109/hpcc-smartcity-dss50907.2020.00017 ◽

2020 ◽

Author(s):

Xiao-Wei Guo ◽

Chao Li ◽

Wei Li ◽

Yu Cao ◽

Yi Liu ◽

...

Keyword(s):

Complex Fluids ◽

Parallel Computers ◽

Massively Parallel ◽

Loop Unrolling ◽

Massively Parallel Computers

Download Full-text

Efficient graphic processing unit implementation of the chemical-potential multiphase lattice Boltzmann method

The International Journal of High Performance Computing Applications ◽

10.1177/1094342020968272 ◽

2020 ◽

Vol 35 (1) ◽

pp. 78-96

Author(s):

Yutong Ye ◽

Hongyin Zhu ◽

Chaoying Zhang ◽

Binghai Wen

Keyword(s):

Lattice Boltzmann Method ◽

Lattice Boltzmann ◽

Chemical Potential ◽

Graphic Processing Unit ◽

Density Ratio ◽

Processing Unit ◽

Central Difference ◽

Loop Unrolling ◽

Boltzmann Method ◽

Graphic Processing

The chemical-potential multiphase lattice Boltzmann method (CP-LBM) has the advantages of satisfying the thermodynamic consistency and Galilean invariance, and it realizes a very large density ratio and easily expresses the surface wettability. Compared with the traditional central difference scheme, the CP-LBM uses the Thomas algorithm to calculate the differences in the multiphase simulations, which significantly improves the calculation accuracy but increases the calculation complexity. In this study, we designed and implemented a parallel algorithm for the chemical-potential model on a graphic processing unit (GPU). Several strategies were used to optimize the GPU algorithm, such as coalesced access, instruction throughput, thread organization, memory access, and loop unrolling. Compared with dual-Xeon 5117 CPU server, our methods achieved 95 times speedup on an NVIDIA RTX 2080Ti GPU and 106 times speedup on an NVIDIA Tesla P100 GPU. When the algorithm was extended to the environment with dual NVIDIA Tesla P100 GPUs, 189 times speedup was achieved and the workload of each GPU reached 96%.

Download Full-text

loop unrolling
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Optimization of the Convolution Operation to Accelerate Deep Neural Networks in FPGA

On the Transformation Optimization for Stencil Computation

Adaptive Data-Transition Decision Feedback Equalizers For High-Speed Serial Links

Adaptive Data-Transition Decision Feedback Equalizers For High-Speed Serial Links

Studying the impacts of loop unrolling and pipeline in the FPGA design of the Simon and RoadRunneR lightweght ciphers

The Effect of Loop Unrolling in Energy Efficient Strassen's Algorithm on Shared Memory Architecture

High Level Synthesis Optimizations of Road Lane Detection Development on Zynq-7000

Técnicas de otimização em Aceleradores Vetoriais NEC SX-Aurora

Improving performance for simulating complex fluids on massively parallel computers by component loop-unrolling and communication hiding

Efficient graphic processing unit implementation of the chemical-potential multiphase lattice Boltzmann method

Export Citation Format

loop unrollingRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Optimization of the Convolution Operation to Accelerate Deep Neural Networks in FPGA

On the Transformation Optimization for Stencil Computation

Adaptive Data-Transition Decision Feedback Equalizers For High-Speed Serial Links

Adaptive Data-Transition Decision Feedback Equalizers For High-Speed Serial Links

Studying the impacts of loop unrolling and pipeline in the FPGA design of the Simon and RoadRunneR lightweght ciphers

The Effect of Loop Unrolling in Energy Efficient Strassen's Algorithm on Shared Memory Architecture

High Level Synthesis Optimizations of Road Lane Detection Development on Zynq-7000

Técnicas de otimização em Aceleradores Vetoriais NEC SX-Aurora

Improving performance for simulating complex fluids on massively parallel computers by component loop-unrolling and communication hiding

Efficient graphic processing unit implementation of the chemical-potential multiphase lattice Boltzmann method

loop unrolling
Recently Published Documents