An Implementation of Configurable SIMD Core on FPGA

2013 ◽  
Vol 336-338 ◽  
pp. 1925-1929
Author(s):  
Guang Wang ◽  
Yin Sheng Gao

In order to meet the computing speed required by 4G wireless communications, and to provide the different data processing widths required by different algorithms, an SIMD (Single Instruction Multiple Data) core has been designed. The ISA (Instruction Set Architecture) and main components of the SIMD core are discussed focus on how the SIMD core can be configured. Finally, the simulation result of the multiplication of two 8*8 matrices is presented to show the execution of instructions in the proposed SIMD core, and the result verifies the correctness of the SIMD core design.

VLSI Design ◽  
2007 ◽  
Vol 2007 ◽  
pp. 1-7 ◽  
Author(s):  
Zheng Shen ◽  
Hu He ◽  
Yanjun Zhang ◽  
Yihe Sun

This paper describes a novel video specific instruction set architecture for ASIP design. With single instruction multiple data (SIMD) instructions, two destination modes, and video specific instructions, an instruction set architecture is introduced to enhance the performance for video applications. Furthermore, we quantify the improvement on H.263 encoding. In this paper, we evaluate and compare the performance of VS-ISA, other DSPs (digital signal processors), and conventional SIMD media extensions in the context of video coding. Our evaluation results show that VS-ISA improves the processor's performance by approximately 5x on H.263 encoding, and VS-ISA outperforms other architectures by 1.6x to 8.57x in computing IDCT.


2018 ◽  
Vol 232 ◽  
pp. 01046
Author(s):  
Wan Qiao ◽  
Dake Liu

In this paper, we propose a flexible scalable BP Polar decoding application-specific instruction set processor (PASIP) that supports multiple code lengths (64 to 4096) and any code rates. High throughputs and sufficient programmability are achieved by the single-instruction-multiple-data (SIMD) based architecture and specially designed Polar decoding acceleration instructions. The synthesis result using 65 nm CMOS technology shows that the total area of PASIP is 2.71 mm2. PASIP provides the maximum throughput of 1563 Mbps (for N = 1024) at the work frequency of 400MHz. The comparison with state-of-art Polar decoders reveals PASIP’s high area efficiency.


2012 ◽  
Vol 2012 ◽  
pp. 1-10 ◽  
Author(s):  
Dau-Chyrh Chang ◽  
Lihong Zhang ◽  
Xiaoling Yang ◽  
Shao-Hsiang Yen ◽  
Wenhua Yu

We introduce a hardware acceleration technique for the parallel finite difference time domain (FDTD) method using the SSE (streaming (single instruction multiple data) SIMD extensions) instruction set. The implementation of SSE instruction set to parallel FDTD method has achieved the significant improvement on the simulation performance. The benchmarks of the SSE acceleration on both the multi-CPU workstation and computer cluster have demonstrated the advantages of (vector arithmetic logic unit) VALU acceleration over GPU acceleration. Several engineering applications are employed to demonstrate the performance of parallel FDTD method enhanced by SSE instruction set.


2017 ◽  
Author(s):  
fenglai liu ◽  
Jing Kong

In this work we present an efficient semi-numerical integral implementation specially designed for the Intel Phi processor to calculate the Hartree-Fock exchange matrix and the energy. Compared with the implementation for the CPU platform, to achieve a productive implementation one needs to focus on the efficient utilization of the SIMD(Single instruction, multiple data) processing unit and maximum cache usage in the Phi processor. For evaluating the efficiency of the implementation, we performed benchmark calculations on the buckyball molecules C60, C100, C180 and C240. For the calculations with basis set 6-311G(2df) and cc-pvtz the benchmark test shows 7-12 times of speedup on the Knight Landing Phi processor 7250 in comparison with traditional four-center electron repulsion integral calculation performed on a six-core Xeon E5-1650 CPU.<br>


Author(s):  
Dae-Hwan Kim

Thumb-2 is the most recent instruction set architecture for ARM processors which are one of the most widely used embedded processors. In this paper, two extensions are proposed to improve the performance of the Thumb-2 instruction set architecture, which are addressing mode extensions and sign/zero extensions combined with data processing instructions. To speed up access to an element of an aggregated data, the proposed approach first introduces three new addressing modes for load and store instructions. They are register-plus-immediate offset addressing mode, negative register offset addressing mode, and post-increment register offset addressing mode. Register-plus-immediate offset addressing mode permits two offsets and negative register offset allows offset to be a negative value of a register content. Post-increment register offset mode automatically modifies the offset address after the memory operation. The second is the sign/zero extension combined with a data processing instruction which allows the result of a data processing operation to be sign/zero extended to accelerate a type conversion. Several least frequently used instructions are reduced to provide the encoding space for the new extensions. Experiments show that the proposed approach improves performance by an average of 8.6% when compared to the Thumb-2 instruction set architecture.


2012 ◽  
Vol 50 (2) ◽  
pp. 122-130 ◽  
Author(s):  
Zukang Shen ◽  
Aris Papasakellariou ◽  
Juan Montojo ◽  
Dirk Gerstenberger ◽  
Fangli Xu

Sign in / Sign up

Export Citation Format

Share Document