A high performance FFT library with single instruction multiple data (SIMD) architecture

We introduce a hardware acceleration technique for the parallel finite difference time domain (FDTD) method using the SSE (streaming (single instruction multiple data) SIMD extensions) instruction set. The implementation of SSE instruction set to parallel FDTD method has achieved the significant improvement on the simulation performance. The benchmarks of the SSE acceleration on both the multi-CPU workstation and computer cluster have demonstrated the advantages of (vector arithmetic logic unit) VALU acceleration over GPU acceleration. Several engineering applications are employed to demonstrate the performance of parallel FDTD method enhanced by SSE instruction set.

Download Full-text

A high-performance and low-power 32-bit multiply-accumulate unit with single-instruction-multiple-data (SIMD) feature

IEEE Journal of Solid-State Circuits ◽

10.1109/jssc.2002.1015692 ◽

2002 ◽

Vol 37 (7) ◽

pp. 926-931 ◽

Cited By ~ 27

Author(s):

Yuyun Liao ◽

D.B. Roberts

Keyword(s):

Low Power ◽

High Performance ◽

Single Instruction Multiple Data ◽

Multiple Data

Download Full-text

The physical structure of concurrent problems and concurrent computers

Philosophical Transactions of the Royal Society of London. Series A, Mathematical and Physical Sciences ◽

10.1098/rsta.1988.0096 ◽

1988 ◽

Vol 326 (1591) ◽

pp. 411-444 ◽

Cited By ~ 11

Keyword(s):

High Performance ◽

Parallel Machines ◽

Temporal Structure ◽

Physical Structure ◽

Massively Parallel ◽

Single Instruction Multiple Data ◽

Multiple Data ◽

Network Methods ◽

Particle Process ◽

Physical Analogy

We introduce a physical analogy to describe problems and high-performance concurrent computers on which they are run. We show that the spatial characteristics of problems lead to their parallelism and review the lessons from use of the early hypercubes and a natural particle-process analogy. We generalize this picture to include the temporal structure of problems and show how this allows us to unify distributed, shared and hierarchical memories as well as SIMD (single instruction multiple data) architectures. We also show how neural network methods can be used to analyse a general formalism based on interacting strings and these lead to possible real-time schedulers and decomposers for massively parallel machines.

Download Full-text

A high-performance sorting algorithm for multicore single-instruction multiple-data processors

Software Practice and Experience ◽

10.1002/spe.1102 ◽

2011 ◽

Vol 42 (6) ◽

pp. 753-777 ◽

Cited By ~ 7

Author(s):

Hiroshi Inoue ◽

Takao Moriyama ◽

Hideaki Komatsu ◽

Toshio Nakatani

Keyword(s):

High Performance ◽

Single Instruction Multiple Data ◽

Sorting Algorithm ◽

Multiple Data

Download Full-text

REVERSIBLE SYSTOLIC ARRAYS: m-ARY BIJECTIVE SINGLE-INSTRUCTION MULTIPLE-DATA (SIMD) ARCHITECTURES AND THEIR QUANTUM CIRCUITS

Journal of Circuits System and Computers ◽

10.1142/s0218126608004472 ◽

2008 ◽

Vol 17 (04) ◽

pp. 729-771 ◽

Cited By ~ 4

Author(s):

ANAS N. AL-RABADI

Keyword(s):

High Performance ◽

Cost Effective ◽

Classical Case ◽

Single Instruction Multiple Data ◽

Systolic Arrays ◽

Quantum Superposition ◽

Multiple Data ◽

Wide Range ◽

New Type ◽

Future Technologies

New type of m-ary systolic arrays called reversible systolic arrays is introduced in this paper. The m-ary quantum systolic architectures' realizations and computations of the new type of systolic arrays are also introduced. A systolic array is an example of a single-instruction multiple-data (SIMD) machine in which each processing element (PE) performs a single simple operation. Systolic devices provide inexpensive but massive computation power, and are cost-effective, high-performance, and special-purpose systems that have wide range of applications such as in solving several regular and compute-bound problems containing repetitive multiple operations on large arrays of data. Similar to the classical case, information in a reversible and quantum systolic circuit flows between cells in a pipelined fashion, and communication with the outside world occurs only at the boundary cells. Since basic PEs used in the construction of arithmetic systolic arrays are the add–multiply cells, the results introduced in this paper are general and apply to a very wide range of add–multiply-based systolic arrays. Since the reduction of power consumption is a major requirement for the circuit design in future technologies, such as in quantum computing, the main features of several future technologies will include reversibility. Consequently, the new systolic circuits can play an important task in the design of future circuits that consume minimal power. It is also shown that the new systolic arrays maintain the high level of regularity while exhibiting the new fundamental bijectivity (reversibility) and quantum superposition properties. These new properties will be essential in performing super-fast arithmetic-intensive computations that are fundamental in several future applications such as in multi-dimensional quantum signal processing (QSP).

Download Full-text

Spatial Reasoning In A Single Instruction/Multiple Data (SIMD) Architecture

10.1117/12.947031 ◽

1988 ◽

Author(s):

Joe R. Brown ◽

Steven F. Venable

Keyword(s):

Spatial Reasoning ◽

Single Instruction Multiple Data ◽

Multiple Data ◽

Simd Architecture

Download Full-text

abPOA: an SIMD-based C library for fast partial order alignment using adaptive band

10.1101/2020.05.07.083196 ◽

2020 ◽

Author(s):

Yan Gao ◽

Yongzhuang Liu ◽

Yanmei Ma ◽

Bo Liu ◽

Yadong Wang ◽

...

Keyword(s):

Error Correction ◽

Partial Order ◽

Directed Acyclic Graph ◽

State Of The Art ◽

Single Instruction Multiple Data ◽

Multiple Sequence ◽

Software Interface ◽

Multiple Data ◽

Long Read ◽

Read Error Correction

AbstractSummaryPartial order alignment, which aligns a sequence to a directed acyclic graph, is now frequently used as a key component in long-read error correction and assembly. We present abPOA (adaptive banded Partial Order Alignment), a Single Instruction Multiple Data (SIMD) based C library for fast partial order alignment using adaptive banded dynamic programming. It can work as a stand-alone multiple sequence alignment and consensus calling tool or be easily integrated into any long-read error correction and assembly workflow. Compared to a state-of-the-art tool (SPOA), abPOA is up to 15 times faster with a comparable alignment accuracy.Availability and implementationabPOA is implemented in C. A stand-alone tool and a C/Python software interface are freely available at https://github.com/yangao07/[email protected] or [email protected]

Download Full-text

SIMD (Single Instruction, Multiple Data) Machines

Encyclopedia of Parallel Computing ◽

10.1007/978-0-387-09766-4_2440 ◽

2011 ◽

pp. 1819-1819

Author(s):

Jack Dongarra ◽

Piotr Luszczek ◽

Felix Wolf ◽

Jesper Larsson Träff ◽

Patrice Quinton ◽

...

Keyword(s):

Single Instruction Multiple Data ◽

Multiple Data

Download Full-text

A radix-2 FFT algorithm for modern single instruction multiple data (SIMD) architectures

IEEE International Conference on Acoustics Speech and Signal Processing ◽

10.1109/icassp.2002.1005373 ◽

2002 ◽

Cited By ~ 9

Author(s):

Rodriguez

Keyword(s):

Single Instruction Multiple Data ◽

Multiple Data

Download Full-text

A scalable ASIP for BP Polar decoding with multiple code lengths

MATEC Web of Conferences ◽

10.1051/matecconf/201823201046 ◽

2018 ◽

Vol 232 ◽

pp. 01046

Author(s):

Wan Qiao ◽

Dake Liu

Keyword(s):

Cmos Technology ◽

Single Instruction Multiple Data ◽

Instruction Set ◽

Maximum Throughput ◽

Specific Instruction ◽

Area Efficiency ◽

Multiple Data ◽

High Area ◽

Multiple Code ◽

Application Specific

In this paper, we propose a flexible scalable BP Polar decoding application-specific instruction set processor (PASIP) that supports multiple code lengths (64 to 4096) and any code rates. High throughputs and sufficient programmability are achieved by the single-instruction-multiple-data (SIMD) based architecture and specially designed Polar decoding acceleration instructions. The synthesis result using 65 nm CMOS technology shows that the total area of PASIP is 2.71 mm2. PASIP provides the maximum throughput of 1563 Mbps (for N = 1024) at the work frequency of 400MHz. The comparison with state-of-art Polar decoders reveals PASIP’s high area efficiency.

Download Full-text