GP-SIMD Processing-in-Memory

This paper presents architectural features and performances for an Integrated Memory Array Processor (IMAP) LSI, which integrates a large capacity memory and a one-dimensional SIMD processor array on a single chip. The IMAP has a conventional memory interface, almost the same as a dual port video RAM with operational input extension. SIMD processing is carried out on the IMAP chip, using an internal processor array, while other higher level processing is concurrently accomplished with external processors through the random access memory port. In addition to the basic IMAP architecture, this paper describes orthogonal IMAP, which has an extended IMAP architecture. The basic IMAP uses a conventional memory cell, while the orthogonal IMAP uses an orthogonal memory for holding images.

Download Full-text

Regional image correspondence matching method for SIMD processing

10.1109/ecctd.2009.5275105 ◽

2009 ◽

Cited By ~ 2

Author(s):

Olli Lahdenoja ◽

Mika Laiho

Keyword(s):

Matching Method ◽

Image Correspondence ◽

Correspondence Matching ◽

Simd Processing

Download Full-text

Polyhedral Object Recognition with Sparse Data in SIMD Processing Mode

10.5244/c.2.16 ◽

1988 ◽

Cited By ~ 1

Author(s):

D. Holder ◽

H. Buxton

Keyword(s):

Object Recognition ◽

Sparse Data ◽

Processing Mode ◽

Simd Processing

Download Full-text

A new simulator workbench for comparing SIMD processing element architectures

Proceedings of the 30th annual Southeast regional conference on - ACM-SE 30 ◽

10.1145/503720.503767 ◽

1992 ◽

Cited By ~ 1

Author(s):

Todd C. Marek

Keyword(s):

Processing Element ◽

Simd Processing

Download Full-text

Cell Processing for Two Scientific Computing Kernels

Handbook of Research on Scalable Computing Technologies ◽

10.4018/978-1-60566-661-7.ch014 ◽

2010 ◽

pp. 312-336

Author(s):

Meilian Xu ◽

Parimala Thulasiraman ◽

Ruppa K. Thulasiram

Keyword(s):

High Speed ◽

Scientific Computing ◽

Building Blocks ◽

Data Locality ◽

Data Mapping ◽

Single Chip ◽

Data Intensive ◽

Synchronization Overhead ◽

Simd Processing ◽

On Chip

This chapter uses two scientific computing kernels to illustrate challenges of designing parallel algorithms for one heterogeneous multi-core processor, the Cell Broadband Engine processor (Cell/B.E.). It describes the limitation of the current parallel systems using single-core processors as building blocks. The limitation deteriorates the performance of applications which have data-intensive and computationintensive kernels such as Finite Difference Time Domain (FDTD) and Fast Fourier Transform (FFT). FDTD is a regular problem with nearest neighbour comminuncation pattern under synchronization constraint. FFT based on indirect swap network (ISN) modifies the data mapping in traditional Cooley- Tukey butterfly network to improve data locality, hence reducing the communication and synchronization overhead. The authors hope to unleash the Cell/B.E. and design parallel FDTD and parallel FFT based on ISN by taking into account unique features of Cell/B.E. such as its eight SIMD processing units on the single chip and its high-speed on-chip bus.

Download Full-text

A Scheme for SIMD Processing in Two Dimensional Binary Images and Its Applications

Lecture Notes in Computer Science - Optical SuperComputing ◽

10.1007/978-3-642-10442-8_12 ◽

2009 ◽

pp. 95-98

Author(s):

Kouichi Nitta ◽

Osamu Matoba

Keyword(s):

Two Dimensional ◽

Binary Images ◽

Simd Processing

Download Full-text

Analysis of Relationship Between SIMD-Processing Features Used in NVIDIA GPUs and NEC SX-Aurora TSUBASA Vector Processors

Lecture Notes in Computer Science - Parallel Computing Technologies ◽

10.1007/978-3-030-25636-4_10 ◽

2019 ◽

pp. 125-139 ◽

Cited By ~ 3

Author(s):

Ilya V. Afanasyev ◽

Vadim V. Voevodin ◽

Vladimir V. Voevodin ◽

Kazuhiko Komatsu ◽

Hiroaki Kobayashi

Keyword(s):

Simd Processing ◽

Vector Processors

Download Full-text

A CMOS vision chip with SIMD processing element array for 1 ms image processing

1999 IEEE International Solid-State Circuits Conference. Digest of Technical Papers. ISSCC. First Edition (Cat. No.99CH36278) ◽

10.1109/isscc.1999.759195 ◽

2003 ◽

Cited By ~ 32

Author(s):

M. Ishikawa ◽

K. Ogawa ◽

T. Komuro ◽

I. Ishii

Keyword(s):

Image Processing ◽

Processing Element ◽

Vision Chip ◽

Simd Processing ◽

Element Array

Download Full-text

Multi-media extensions in super-pipelined micro-architectures. A new case for SIMD processing?

Proceedings Fifth IEEE International Workshop on Computer Architectures for Machine Perception ◽

10.1109/camp.2000.875984 ◽

2002 ◽

Cited By ~ 5

Author(s):

M. Ferretti

Keyword(s):

Simd Processing ◽

Multi Media

Download Full-text

A Fast CT Reconstruction Scheme for a General Multi-Core PC

International Journal of Biomedical Imaging ◽

10.1155/2007/29160 ◽

2007 ◽

Vol 2007 ◽

pp. 1-9 ◽

Cited By ~ 16

Author(s):

Kai Zeng ◽

Erwei Bai ◽

Ge Wang

Keyword(s):

Data Exchange ◽

Graphics Processing Unit ◽

Computational Cost ◽

Processing Unit ◽

Ct Reconstruction ◽

Reconstruction Process ◽

Multiple Data ◽

Simd Processing ◽

Specialized Hardware ◽

Graphics Processing

Expensive computational cost is a severe limitation in CT reconstruction for clinical applications that need real-time feedback. A primary example is bolus-chasing computed tomography (CT) angiography (BCA) that we have been developing for the past several years. To accelerate the reconstruction process using the filtered backprojection (FBP) method, specialized hardware or graphics cards can be used. However, specialized hardware is expensive and not flexible. The graphics processing unit (GPU) in a current graphic card can only reconstruct images in a reduced precision and is not easy to program. In this paper, an acceleration scheme is proposed based on a multi-core PC. In the proposed scheme, several techniques are integrated, including utilization of geometric symmetry, optimization of data structures, single-instruction multiple-data (SIMD) processing, multithreaded computation, and an Intel C++ compilier. Our scheme maintains the original precision and involves no data exchange between the GPU and CPU. The merits of our scheme are demonstrated in numerical experiments against the traditional implementation. Our scheme achieves a speedup of about 40, which can be further improved by several folds using the latest quad-core processors.

Download Full-text