A High-Performance Parallel FDTD Method Enhanced by Using SSE Instruction Set

We introduce a hardware acceleration technique for the parallel finite difference time domain (FDTD) method using the SSE (streaming (single instruction multiple data) SIMD extensions) instruction set. The implementation of SSE instruction set to parallel FDTD method has achieved the significant improvement on the simulation performance. The benchmarks of the SSE acceleration on both the multi-CPU workstation and computer cluster have demonstrated the advantages of (vector arithmetic logic unit) VALU acceleration over GPU acceleration. Several engineering applications are employed to demonstrate the performance of parallel FDTD method enhanced by SSE instruction set.

Download Full-text

A novel hardware acceleration technique for high performance parallel FDTD method

2011 IEEE International Symposium on Antennas and Propagation (APSURSI) ◽

10.1109/aps.2011.5997202 ◽

2011 ◽

Author(s):

Wenhua Yu ◽

Xiaoling Yang ◽

Yongjun Liu ◽

Raj Mittra

Keyword(s):

High Performance ◽

Fdtd Method ◽

Hardware Acceleration ◽

Acceleration Technique

Download Full-text

A novel hardware acceleration technique for high performance parallel FDTD method

2011 IEEE International Conference on Microwave Technology & Computational Electromagnetics ◽

10.1109/icmtce.2011.5915554 ◽

2011 ◽

Author(s):

Wenhua Yu

Keyword(s):

High Performance ◽

Fdtd Method ◽

Hardware Acceleration ◽

Acceleration Technique

Download Full-text

A novel hardware acceleration technique for high performance parallel FDTD method

IEEE iWEM2011 ◽

10.1109/iwem.2011.6021490 ◽

2011 ◽

Author(s):

Wenhua Yu ◽

Xiaoling Yang ◽

Yongjun Liu

Keyword(s):

High Performance ◽

Fdtd Method ◽

Hardware Acceleration ◽

Acceleration Technique

Download Full-text

The investigation of the features optical vortices focusing by ring gratings with the variable height using high-performance computer systems

Journal of Physics Conference Series ◽

10.1088/1742-6596/2086/1/012166 ◽

2021 ◽

Vol 2086 (1) ◽

pp. 012166

Author(s):

D A Savelyev

Keyword(s):

High Performance ◽

Focal Spot ◽

Fdtd Method ◽

Spot Size ◽

Laser Beams ◽

Optical Vortices ◽

Focal Spot Size ◽

Near Zone ◽

High Performance Computer ◽

Difference Time

Abstract The diffraction of vortex laser beams with circular polarization by ring gratings with the variable height was investigated in this paper. Modelling of near zone diffraction is numerically investigated by the finite difference time domain (FDTD) method. The changes in the length size of the light needle and focal spot size are shown depending on the type of the ring grating.

Download Full-text

Hardware Acceleration for RLNC: A Case Study Based on the Xtensa Processor with the Tensilica Instruction-Set Extension

Electronics ◽

10.3390/electronics7090180 ◽

2018 ◽

Vol 7 (9) ◽

pp. 180 ◽

Cited By ~ 2

Author(s):

Javier Acevedo ◽

Robert Scheffel ◽

Simon Wunderlich ◽

Mattis Hasler ◽

Sreekrishna Pandi ◽

...

Keyword(s):

Hardware Acceleration ◽

Code Word ◽

Instruction Set ◽

Linear Network ◽

Galois Fields ◽

Linear Network Coding ◽

Multiple Data ◽

Instruction Set Extension ◽

Energy Constrained

Random linear network coding (RLNC) can greatly aid data transmission in lossy wireless networks. However, RLNC requires computationally complex matrix multiplications and inversions in finite fields (Galois fields). These computations are highly demanding for energy-constrained mobile devices. The presented case study evaluates hardware acceleration strategies for RLNC in the context of the Tensilica Xtensa LX5 processor with the tensilica instruction set extension (TIE). More specifically, we develop TIEs for multiply-accumulate (MAC) operations for accelerating matrix multiplications in Galois fields, single instruction multiple data (SIMD) instructions operating on consecutive memory locations, as well as the flexible-length instruction extension (FLIX). We evaluate the number of clock cycles required for RLNC encoding and decoding without and with the MAC, SIMD, and FLIX acceleration strategies. We also evaluate the RLNC encoding and decoding throughput and energy consumption for a range of RLNC generation and code word sizes. We find that for GF ( 2 8 ) and GF ( 2 16 ) RLNC encoding, the SIMD and FLIX acceleration strategies achieve speedups of approximately four hundred fold compared to a benchmark C code implementation without TIE. We also find that the unicore Xtensa LX5 with SIMD has seven to thirty times higher RLNC encoding and decoding throughput than the state-of-the-art ODROID XU3 system-on-a-chip (SoC) operating with a single core; the Xtensa LX5 with FLIX, in turn, increases the throughput by roughly 25% compared to utilizing only SIMD. Furthermore, the Xtensa LX5 with FLIX consumes roughly four orders of magnitude less energy than the ODROID XU3 SoC.

Download Full-text

A scalable ASIP for BP Polar decoding with multiple code lengths

MATEC Web of Conferences ◽

10.1051/matecconf/201823201046 ◽

2018 ◽

Vol 232 ◽

pp. 01046

Author(s):

Wan Qiao ◽

Dake Liu

Keyword(s):

Cmos Technology ◽

Single Instruction Multiple Data ◽

Instruction Set ◽

Maximum Throughput ◽

Specific Instruction ◽

Area Efficiency ◽

Multiple Data ◽

High Area ◽

Multiple Code ◽

Application Specific

In this paper, we propose a flexible scalable BP Polar decoding application-specific instruction set processor (PASIP) that supports multiple code lengths (64 to 4096) and any code rates. High throughputs and sufficient programmability are achieved by the single-instruction-multiple-data (SIMD) based architecture and specially designed Polar decoding acceleration instructions. The synthesis result using 65 nm CMOS technology shows that the total area of PASIP is 2.71 mm2. PASIP provides the maximum throughput of 1563 Mbps (for N = 1024) at the work frequency of 400MHz. The comparison with state-of-art Polar decoders reveals PASIP’s high area efficiency.

Download Full-text

A Video Specific Instruction Set Architecture for ASIP design

VLSI Design ◽

10.1155/2007/58431 ◽

2007 ◽

Vol 2007 ◽

pp. 1-7 ◽

Cited By ~ 5

Author(s):

Zheng Shen ◽

Hu He ◽

Yanjun Zhang ◽

Yihe Sun

Keyword(s):

Video Coding ◽

Digital Signal ◽

Digital Signal Processors ◽

Single Instruction Multiple Data ◽

Instruction Set ◽

Instruction Set Architecture ◽

Specific Instruction ◽

Multiple Data ◽

Signal Processors

This paper describes a novel video specific instruction set architecture for ASIP design. With single instruction multiple data (SIMD) instructions, two destination modes, and video specific instructions, an instruction set architecture is introduced to enhance the performance for video applications. Furthermore, we quantify the improvement on H.263 encoding. In this paper, we evaluate and compare the performance of VS-ISA, other DSPs (digital signal processors), and conventional SIMD media extensions in the context of video coding. Our evaluation results show that VS-ISA improves the processor's performance by approximately 5x on H.263 encoding, and VS-ISA outperforms other architectures by 1.6x to 8.57x in computing IDCT.

Download Full-text

An Implementation of Configurable SIMD Core on FPGA

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.336-338.1925 ◽

2013 ◽

Vol 336-338 ◽

pp. 1925-1929

Author(s):

Guang Wang ◽

Yin Sheng Gao

Keyword(s):

Wireless Communications ◽

Data Processing ◽

Single Instruction Multiple Data ◽

Instruction Set ◽

Instruction Set Architecture ◽

Multiple Data ◽

4G Wireless ◽

Main Components ◽

Computing Speed

In order to meet the computing speed required by 4G wireless communications, and to provide the different data processing widths required by different algorithms, an SIMD (Single Instruction Multiple Data) core has been designed. The ISA (Instruction Set Architecture) and main components of the SIMD core are discussed focus on how the SIMD core can be configured. Finally, the simulation result of the multiplication of two 8*8 matrices is presented to show the execution of instructions in the proposed SIMD core, and the result verifies the correctness of the SIMD core design.

Download Full-text

A high-performance and low-power 32-bit multiply-accumulate unit with single-instruction-multiple-data (SIMD) feature

IEEE Journal of Solid-State Circuits ◽

10.1109/jssc.2002.1015692 ◽

2002 ◽

Vol 37 (7) ◽

pp. 926-931 ◽

Cited By ~ 27

Author(s):

Yuyun Liao ◽

D.B. Roberts

Keyword(s):

Low Power ◽

High Performance ◽

Single Instruction Multiple Data ◽

Multiple Data

Download Full-text

The physical structure of concurrent problems and concurrent computers

Philosophical Transactions of the Royal Society of London. Series A, Mathematical and Physical Sciences ◽

10.1098/rsta.1988.0096 ◽

1988 ◽

Vol 326 (1591) ◽

pp. 411-444 ◽

Cited By ~ 11

Keyword(s):

High Performance ◽

Parallel Machines ◽

Temporal Structure ◽

Physical Structure ◽

Massively Parallel ◽

Single Instruction Multiple Data ◽

Multiple Data ◽

Network Methods ◽

Particle Process ◽

Physical Analogy

We introduce a physical analogy to describe problems and high-performance concurrent computers on which they are run. We show that the spatial characteristics of problems lead to their parallelism and review the lessons from use of the early hypercubes and a natural particle-process analogy. We generalize this picture to include the temporal structure of problems and show how this allows us to unify distributed, shared and hierarchical memories as well as SIMD (single instruction multiple data) architectures. We also show how neural network methods can be used to analyse a general formalism based on interacting strings and these lead to possible real-time schedulers and decomposers for massively parallel machines.

Download Full-text