scholarly journals Reducing the Delay for Decoding Instructions by Predicting Their Source Register Operands

Electronics ◽  
2020 ◽  
Vol 9 (5) ◽  
pp. 820
Author(s):  
Sanghyun Park ◽  
Jaeyung Jun ◽  
Changhyun Kim ◽  
Gyeong Il Min ◽  
Hun Jae Lee ◽  
...  

The fetched instructions would have data dependency with in-flight ones in the pipeline execution of a processor, so the dependency prevents the processor from executing the incoming instructions for guaranteeing the program’s correctness. The register and memory dependencies are detected in the decode and memory stages, respectively. In a small embedded processor that supports as many ISAsas possible to reduce code size, the instruction decoding to identify register usage with the dependence check generally results in long delay and sometimes a critical path in its implementation. For reducing the delay, this paper proposes two methods—One method assumes the widely used source register operand bit-fields without fully decoding the instructions. However, this assumption would cause additional stalls due to the incorrect prediction; thus, it would degrade the performance. To solve this problem, as the other method, we adopt a table-based way to store the dependence history and later use this information for more precisely predicting the dependency. We applied our methods to the commercial EISC embedded processor with the Samsung 65nm process; thus, we reduced the critical path delay and increased its maximum operating frequency by 12.5% and achieved an average 11.4% speed-up in the execution time of the EEMBC applications. We also improved the static, dynamic power consumption, and EDP by 7.2%, 8.5%, and 13.6%, respectively, despite the implementation area overhead of 2.5%.

Author(s):  
Kenta Shirane ◽  
Takahiro Yamamoto ◽  
Hiroyuki Tomiyama

In this paper, we present a case study on approximate multipliers for MNIST Convolutional Neural Network (CNN). We apply approximate multipliers with different bit-width to the convolution layer in MNIST CNN, evaluate the accuracy of MNIST classification, and analyze the trade-off between approximate multiplier’s area, critical path delay and the accuracy. Based on the results of the evaluation and analysis, we propose a design methodology for approximate multipliers. The approximate multipliers consist of some partial products, which are carefully selected according to the CNN input. With this methodology, we further reduce the area and the delay of the multipliers with keeping high accuracy of the MNIST classification.


2018 ◽  
Vol 7 (2.16) ◽  
pp. 94
Author(s):  
Abhishek Choubey ◽  
SPV Subbarao ◽  
Shruti B. Choubey

Multiplication is one of the most an essential arithmetic operation used in numerous applications in digital signal processing and communications. These applications need transformations, convolutions and dot products that involve an enormous amount of multiplications of an operand with a constant. Typical examples include wavelet, digital filters, such as FIR or IIR. However, multiplier structures have relatively large area-delay product, long latency and significantly high power consumption compared to other the arithmetic structure. Therefore, low power multiplier design has been always a significant part of DSP structure for VLSI design. The Booth multiplier is promising as the most efficient amongst the others multiplier as it reduces the complexity of considerably than others. In this paper, we have proposed Booth-multiplier using seamless pipelining. Theoretical comparison results show that the proposed Booth multiplier requires less critical path delay compared to traditional Booth multiplier. ASIC simulation results show proposed radix-16 Booth multiplier 13% less critical path delay for word width n=16 and 17% less critical path delay compared for bit width n=32 to best existing radix-16 Booth multiplier. 


Electronics ◽  
2021 ◽  
Vol 10 (5) ◽  
pp. 630
Author(s):  
Padmanabhan Balasubramanian ◽  
Raunaq Nayar ◽  
Douglas L. Maskell

This article describes the design of approximate array multipliers by making vertical or horizontal cuts in an accurate array multiplier followed by different input and output assignments within the multiplier. We consider a digital image denoising application and show how different combinations of input and output assignments in an approximate array multiplier affect the quality of the denoised images. We consider the accurate array multiplier and several approximate array multipliers for synthesis. The multipliers were described in Verilog hardware description language and synthesized by Synopsys Design Compiler using a 32/28-nm complementary metal-oxide-semiconductor technology. The results show that compared to the accurate array multiplier, one of the proposed approximate array multipliers viz. PAAM01-V7 achieves a 28% reduction in critical path delay, 75.8% reduction in power, and 64.6% reduction in area while enabling the production of a denoised image that is comparable in quality to the image denoised using the accurate array multiplier. The standard design metrics such as critical path delay, total power dissipation, and area of the accurate and approximate multipliers are given, the error parameters of the approximate array multipliers are provided, and the original image, the noisy image, and the denoised images are also depicted for comparison.


2017 ◽  
Vol 2017 ◽  
pp. 1-9 ◽  
Author(s):  
Ali Asghar ◽  
Muhammad Mazher Iqbal ◽  
Waqar Ahmed ◽  
Mujahid Ali ◽  
Husain Parvez ◽  
...  

In modern SRAM based Field Programmable Gate Arrays, a Look-Up Table (LUT) is the principal constituent logic element which can realize every possible Boolean function. However, this flexibility of LUTs comes with a heavy area penalty. A part of this area overhead comes from the increased amount of configuration memory which rises exponentially as the LUT size increases. In this paper, we first present a detailed analysis of a previously proposed FPGA architecture which allows sharing of LUTs memory (SRAM) tables among NPN-equivalent functions, to reduce the area as well as the number of configuration bits. We then propose several methods to improve the existing architecture. A new clustering technique has been proposed which packs NPN-equivalent functions together inside a Configurable Logic Block (CLB). We also make use of a recently proposed high performance Boolean matching algorithm to perform NPN classification. To enhance area savings further, we evaluate the feasibility of more than two LUTs sharing the same SRAM table. Consequently, this work explores the SRAM table sharing approach for a range of LUT sizes (4–7), while varying the cluster sizes (4–16). Experimental results on MCNC benchmark circuits set show an overall area reduction of ~7% while maintaining the same critical path delay.


2008 ◽  
Vol 24 (1-3) ◽  
pp. 203-222 ◽  
Author(s):  
Kyriakos Christou ◽  
Maria K. Michael ◽  
Spyros Tragoudas

2020 ◽  
Vol 2020 ◽  
pp. 1-13
Author(s):  
Mohd Tausif ◽  
Ekram Khan ◽  
Mohd Hasan ◽  
Martin Reisslein

This paper proposes and evaluates the LFrWF, a novel lifting-based architecture to compute the discrete wavelet transform (DWT) of images using the fractional wavelet filter (FrWF). In order to reduce the memory requirement of the proposed architecture, only one image line is read into a buffer at a time. Aside from an LFrWF version with multipliers, i.e., the LFr WF m , we develop a multiplier-less LFrWF version, i.e., the LFr WF ml , which reduces the critical path delay (CPD) to the delay T a of an adder. The proposed LFr WF m and LFr WF ml architectures are compared in terms of the required adders, multipliers, memory, and critical path delay with state-of-the-art DWT architectures. Moreover, the proposed LFr WF m and LFr WF ml architectures, along with the state-of-the-art FrWF architectures (with multipliers (Fr WF m ) and without multipliers (Fr WF ml )) are compared through implementation on the same FPGA board. The LFr WF m requires 22% less look-up tables (LUT), 34% less flip-flops (FF), and 50% less compute cycles (CC) and consumes 65% less energy than the Fr WF m . Also, the proposed LFr WF ml architecture requires 50% less CC and consumes 43% less energy than the Fr WF ml . Thus, the proposed LFr WF m and LFr WF ml architectures appear suitable for computing the DWT of images on wearable sensors.


Sign in / Sign up

Export Citation Format

Share Document