A high-performance and low-power 32-bit multiply-accumulate unit with single-instruction-multiple-data (SIMD) feature

We introduce a hardware acceleration technique for the parallel finite difference time domain (FDTD) method using the SSE (streaming (single instruction multiple data) SIMD extensions) instruction set. The implementation of SSE instruction set to parallel FDTD method has achieved the significant improvement on the simulation performance. The benchmarks of the SSE acceleration on both the multi-CPU workstation and computer cluster have demonstrated the advantages of (vector arithmetic logic unit) VALU acceleration over GPU acceleration. Several engineering applications are employed to demonstrate the performance of parallel FDTD method enhanced by SSE instruction set.

Download Full-text

The physical structure of concurrent problems and concurrent computers

Philosophical Transactions of the Royal Society of London. Series A, Mathematical and Physical Sciences ◽

10.1098/rsta.1988.0096 ◽

1988 ◽

Vol 326 (1591) ◽

pp. 411-444 ◽

Cited By ~ 11

Keyword(s):

High Performance ◽

Parallel Machines ◽

Temporal Structure ◽

Physical Structure ◽

Massively Parallel ◽

Single Instruction Multiple Data ◽

Multiple Data ◽

Network Methods ◽

Particle Process ◽

Physical Analogy

We introduce a physical analogy to describe problems and high-performance concurrent computers on which they are run. We show that the spatial characteristics of problems lead to their parallelism and review the lessons from use of the early hypercubes and a natural particle-process analogy. We generalize this picture to include the temporal structure of problems and show how this allows us to unify distributed, shared and hierarchical memories as well as SIMD (single instruction multiple data) architectures. We also show how neural network methods can be used to analyse a general formalism based on interacting strings and these lead to possible real-time schedulers and decomposers for massively parallel machines.

Download Full-text

A high-performance sorting algorithm for multicore single-instruction multiple-data processors

Software Practice and Experience ◽

10.1002/spe.1102 ◽

2011 ◽

Vol 42 (6) ◽

pp. 753-777 ◽

Cited By ~ 7

Author(s):

Hiroshi Inoue ◽

Takao Moriyama ◽

Hideaki Komatsu ◽

Toshio Nakatani

Keyword(s):

High Performance ◽

Single Instruction Multiple Data ◽

Sorting Algorithm ◽

Multiple Data

Download Full-text

A high performance FFT library with single instruction multiple data (SIMD) architecture

2011 International Conference on Electronics, Communications and Control (ICECC) ◽

10.1109/icecc.2011.6066463 ◽

2011 ◽

Cited By ~ 5

Author(s):

Wang Xu ◽

Zhang Yan ◽

Ding Shunying

Keyword(s):

High Performance ◽

Single Instruction Multiple Data ◽

Multiple Data ◽

Simd Architecture

Download Full-text

REVERSIBLE SYSTOLIC ARRAYS: m-ARY BIJECTIVE SINGLE-INSTRUCTION MULTIPLE-DATA (SIMD) ARCHITECTURES AND THEIR QUANTUM CIRCUITS

Journal of Circuits System and Computers ◽

10.1142/s0218126608004472 ◽

2008 ◽

Vol 17 (04) ◽

pp. 729-771 ◽

Cited By ~ 4

Author(s):

ANAS N. AL-RABADI

Keyword(s):

High Performance ◽

Cost Effective ◽

Classical Case ◽

Single Instruction Multiple Data ◽

Systolic Arrays ◽

Quantum Superposition ◽

Multiple Data ◽

Wide Range ◽

New Type ◽

Future Technologies

New type of m-ary systolic arrays called reversible systolic arrays is introduced in this paper. The m-ary quantum systolic architectures' realizations and computations of the new type of systolic arrays are also introduced. A systolic array is an example of a single-instruction multiple-data (SIMD) machine in which each processing element (PE) performs a single simple operation. Systolic devices provide inexpensive but massive computation power, and are cost-effective, high-performance, and special-purpose systems that have wide range of applications such as in solving several regular and compute-bound problems containing repetitive multiple operations on large arrays of data. Similar to the classical case, information in a reversible and quantum systolic circuit flows between cells in a pipelined fashion, and communication with the outside world occurs only at the boundary cells. Since basic PEs used in the construction of arithmetic systolic arrays are the add–multiply cells, the results introduced in this paper are general and apply to a very wide range of add–multiply-based systolic arrays. Since the reduction of power consumption is a major requirement for the circuit design in future technologies, such as in quantum computing, the main features of several future technologies will include reversibility. Consequently, the new systolic circuits can play an important task in the design of future circuits that consume minimal power. It is also shown that the new systolic arrays maintain the high level of regularity while exhibiting the new fundamental bijectivity (reversibility) and quantum superposition properties. These new properties will be essential in performing super-fast arithmetic-intensive computations that are fundamental in several future applications such as in multi-dimensional quantum signal processing (QSP).

Download Full-text

Design of a Low Power, High Performance BICMOS Current-limiting Circuit for DC-DC Converter Application

PIERS Online ◽

10.2529/piers060817034009 ◽

2007 ◽

Vol 3 (4) ◽

pp. 368-373 ◽

Cited By ~ 5

Author(s):

Hongbo Ma ◽

Quanyuan Feng

Keyword(s):

Low Power ◽

High Performance ◽

Current Limiting

Download Full-text

Efficient Instruction and Data Caching for High Performance Embedded Processors

Jornada de Jóvenes Investigadores del I3A ◽

10.26754/jji-i3a.201201788 ◽

1970 ◽

pp. 9

Author(s):

A. Ferrerón Labari ◽

D. Suárez Gracia ◽

V. Viñals Yúfera

Keyword(s):

Embedded Systems ◽

Power Consumption ◽

Low Power ◽

Interconnection Networks ◽

High Performance ◽

Critical Issue ◽

Content Management ◽

Structure Design ◽

Portable Devices ◽

On Chip

In the last years, embedded systems have evolved so that they offer capabilities we could only find before in high performance systems. Portable devices already have multiprocessors on-chip (such as PowerPC 476FP or ARM Cortex A9 MP), usually multi-threaded, and a powerful multi-level cache memory hierarchy on-chip. As most of these systems are battery-powered, the power consumption becomes a critical issue. Achieving high performance and low power consumption is a high complexity challenge where some proposals have been already made. Suarez et al. proposed a new cache hierarchy on-chip, the LP-NUCA (Low Power NUCA), which is able to reduce the access latency taking advantage of NUCA (Non-Uniform Cache Architectures) properties. The key points are decoupling the functionality, and utilizing three specialized networks on-chip. This structure has been proved to be efficient for data hierarchies, achieving a good performance and reducing the energy consumption. On the other hand, instruction caches have different requirements and characteristics than data caches, contradicting the low-power embedded systems requirements, especially in SMT (simultaneous multi-threading) environments. We want to study the benefits of utilizing small tiled caches for the instruction hierarchy, so we propose a new design, ID-LP-NUCAs. Thus, we need to re-evaluate completely our previous design in terms of structure design, interconnection networks (including topologies, flow control and routing), content management (with special interest in hardware/software content allocation policies), and structure sharing. In CMP environments (chip multiprocessors) with parallel workloads, coherence plays an important role, and must be taken into consideration.

Download Full-text

High-Performance and Low-Power Full Color Reflective LCD for New Applications

Proceedings of the International Display Workshops ◽

10.36463/idw.2019.1411 ◽

2019 ◽

pp. 1411

Author(s):

Hiroyuki Hakoi ◽

Ming Ni ◽

Junichi Hashimoto ◽

Takashi Sato ◽

Shinji Shimada ◽

...

Keyword(s):

Low Power ◽

High Performance ◽

Full Color ◽

New Applications

Download Full-text

Performance Analysis of Various Multipliers Using 8T-full Adder with 180nm Technology

Recent Advances in Electrical & Electronic Engineering (Formerly Recent Patents on Electrical & Electronic Engineering) ◽

10.2174/2352096513666200107091932 ◽

2020 ◽

Vol 13 (6) ◽

pp. 864-870

Author(s):

Sai Venkatramana Prasada G.S ◽

G. Seshikala ◽

S. Niranjana

Keyword(s):

Low Power ◽

Power Dissipation ◽

High Speed ◽

High Performance ◽

Full Adder ◽

Fundamental Operation ◽

Wallace Tree ◽

Power Delay Product ◽

The Comparative Study ◽

Wallace Tree Multiplier

Background: This paper presents the comparative study of power dissipation, delay and power delay product (PDP) of different full adders and multiplier designs. Methods: Full adder is the fundamental operation for any processors, DSP architectures and VLSI systems. Here ten different full adder structures were analyzed for their best performance using a Mentor Graphics tool with 180nm technology. Results: From the analysis result high performance full adder is extracted for further higher level designs. 8T full adder exhibits high speed, low power delay and low power delay product and hence it is considered to construct four different multiplier designs, such as Array multiplier, Baugh Wooley multiplier, Braun multiplier and Wallace Tree multiplier. These different structures of multipliers were designed using 8T full adder and simulated using Mentor Graphics tool in a constant W/L aspect ratio. Conclusion: From the analysis, it is concluded that Wallace Tree multiplier is the high speed multiplier but dissipates comparatively high power. Baugh Wooley multiplier dissipates less power but exhibits more time delay and low PDP.

Download Full-text

A reconfigurable low-power high-performance matrix multiplier architecture with borrow parallel counters

Proceedings International Parallel and Distributed Processing Symposium ◽

10.1109/ipdps.2003.1213336 ◽

2004 ◽

Author(s):

Rong Lin

Keyword(s):

Low Power ◽

High Performance ◽

Performance Matrix

Download Full-text