Design and implementation of fast and hardware‐efficient parallel processing elements to set full and partial permutations in Beneš networks

In this paper we present the design and implementation of the Linpack benchmark for the IBM BladeCenter QS22, which incorporates two IBM PowerXCell 8i1processors. The PowerXCell 8i is a new implementation of the Cell Broadband Engine™2 architecture and contains a set of special-purpose processing cores known as Synergistic Processing Elements (SPEs). The SPEs can be used as computational accelerators to augment the main PowerPC processor. The added computational capability of the SPEs results in a peak double precision floating point capability of 108.8 GFLOPS. We explain how we modified the standard open source implementation of Linpack to accelerate key computational kernels using the SPEs of the PowerXCell 8i processors. We describe in detail the implementation and performance of the computational kernels and also explain how we employed the SPEs for high-speed data movement and reformatting. The result of these modifications is a Linpack benchmark optimized for the IBM PowerXCell 8i processor that achieves 170.7 GFLOPS on a BladeCenter QS22 with 32 GB of DDR2 SDRAM memory. Our implementation of Linpack also supports clusters of QS22s, and was used to achieve a result of 11.1 TFLOPS on a cluster of 84 QS22 blades. We compare our results on a single BladeCenter QS22 with the base Linpack implementation without SPE acceleration to illustrate the benefits of our optimizations.

Download Full-text

Design and Implementation of Rearrangable Non-Blocking Switching Network in VLSI

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1356.0982s1119 ◽

2019 ◽

Vol 8 (2S11) ◽

pp. 2858-2863

Keyword(s):

Interconnection Network ◽

The Other ◽

Switching Network ◽

Low Latency ◽

Multistage Interconnection Network ◽

Design And Implementation ◽

Benes Network ◽

Benes Networks

The main goal of this article is to implement an effective Non-Blocking Benes switching Network. Benes Switching Network is designed with the uncomplicated switch modules & it’s have so many advantages, small latency, less traffic and it’s required number of switch modules. Clos and Benes networks are play a key role in the class of multistage interconnection network because of their extensibility and mortality. Benes network provides a low latency when compare with the other networks. 8x8 Benes non blocking switching network is designed and synthesized with the using of Xilinx tool 12.1.

Download Full-text

Design And Implementation of Combined Pipelining and Parallel Processing Architecture for FIR and IIR Filters Using VHDL

International Journal of VLSI Design & Communication Systems ◽

10.5121/vlsic.2019.10401 ◽

2019 ◽

Vol 10 (4) ◽

pp. 1-16

Author(s):

Jacinta Potsangbam ◽

Manoj Kumar

Keyword(s):

Parallel Processing ◽

Iir Filters ◽

Design And Implementation ◽

Processing Architecture

Download Full-text

Nonlinear dynamics in telecommunication systems: design and implementation of a large array of service processing elements

International Journal of Communication Systems ◽

10.1002/(sici)1099-1131(199705/06)10:3<147::aid-dac335>3.0.co;2-z ◽

1997 ◽

Vol 10 (3) ◽

pp. 147-159 ◽

Cited By ~ 1

Author(s):

C. T. Pointon ◽

R. A. Carrasco ◽

M. A. Gell

Keyword(s):

Nonlinear Dynamics ◽

Systems Design ◽

Large Array ◽

Processing Elements ◽

Design And Implementation ◽

Telecommunication Systems

Download Full-text

Design and implementation of a real-time identification system using parallel processing technique for adaptive control of a DC motor drive

Proceedings of IECON '95 - 21st Annual Conference on IEEE Industrial Electronics ◽

10.1109/iecon.1995.483834 ◽

2002 ◽

Author(s):

Ming-Fa Tsai ◽

Ying-Yu Tzou

Keyword(s):

Adaptive Control ◽

Parallel Processing ◽

Real Time ◽

Dc Motor ◽

Processing Technique ◽

Motor Drive ◽

Identification System ◽

Design And Implementation ◽

Real Time Identification

Download Full-text

Design and Implementation of a Sunshine Duration Calculation System with Massively Parallel Processing

Big Data Applications and Services 2017 - Advances in Intelligent Systems and Computing ◽

10.1007/978-981-13-0695-2_11 ◽

2018 ◽

pp. 91-97 ◽

Cited By ~ 1

Author(s):

Woosuk Shin ◽

Nakhoon Baek

Keyword(s):

Parallel Processing ◽

Sunshine Duration ◽

Massively Parallel ◽

Design And Implementation ◽

Massively Parallel Processing ◽

Calculation System

Download Full-text

Design and Implementation of Parallel Processing in Embedded PDC Application for FACTS Wide-Area Damping Control

Interconnected Power Systems - Power Systems ◽

10.1007/978-3-662-48627-6_12 ◽

2015 ◽

pp. 203-223

Author(s):

Yong Li ◽

Dechang Yang ◽

Fang Liu ◽

Yijia Cao ◽

Christian Rehtanz

Keyword(s):

Parallel Processing ◽

Wide Area ◽

Design And Implementation ◽

Damping Control

Download Full-text

Low latency modular multiplication for public-key cryptosystems using a scalable array of parallel processing elements

2013 IEEE 56th International Midwest Symposium on Circuits and Systems (MWSCAS) ◽

10.1109/mwscas.2013.6674830 ◽

2013 ◽

Cited By ~ 3

Author(s):

Yinan Kong ◽

Yufeng Lai

Keyword(s):

Parallel Processing ◽

Public Key ◽

Low Latency ◽

Modular Multiplication ◽

Processing Elements ◽

Public Key Cryptosystems

Download Full-text

Hardware-Abbildung eines videobasierten Verfahrens zur echtzeitfähigen Auswertung von Winkelhistogrammen auf eine modulare Coprozessor-Architektur

Advances in Radio Science ◽

10.5194/ars-8-135-2010 ◽

2010 ◽

Vol 8 ◽

pp. 135-142 ◽

Cited By ~ 1

Author(s):

H. Flatt ◽

A. Tarnowsky ◽

H. Blume ◽

P. Pirsch

Keyword(s):

Image Processing ◽

Parallel Processing ◽

Real Time ◽

Video Stream ◽

Image Resolution ◽

Clock Frequency ◽

Processing Elements ◽

Risc Processor ◽

Configurable Architecture ◽

Time Evaluation

Abstract. Dieser Beitrag behandelt die Abbildung eines videobasierten Verfahrens zur echtzeitfähigen Auswertung von Winkelhistogrammen auf eine modulare Coprozessor-Architektur. Die Architektur besteht aus mehreren dedizierten Recheneinheiten zur parallelen Verarbeitung rechenintensiver Bildverarbeitungsverfahren und ist mit einem RISC-Prozessor verbunden. Eine konfigurierbare Architekturerweiterung um eine Recheneinheit zur Auswertung von Winkelhistogrammen von Objekten ermöglicht in Verbindung mit dem RISC eine echtzeitfähige Klassifikation. Je nach Konfiguration sind für die Architekturerweiterung auf einem Xilinx Virtex-5-FPGA zwischen 3300 und 12 000 Lookup-Tables erforderlich. Bei einer Taktfrequenz von 100 MHz können unabhängig von der Bildauflösung pro Einzelbild in einem 25-Hz-Videodatenstrom bis zu 100 Objekte der Größe 256×256 Pixel analysiert werden. This paper presents the mapping of a video-based approach for real-time evaluation of angular histograms on a modular coprocessor architecture. The architecture comprises several dedicated processing elements for parallel processing of computation-intensive image processing tasks and is coupled with a RISC processor. A configurable architecture extension, especially a processing element for evaluating angular histograms of objects in conjunction with a RISC processor, provides a real-time classification. Depending on the configuration of the architecture extension, 3 300 to 12 000 look-up tables are required for a Xilinx Virtex-5 FPGA implementation. Running at a clock frequency of 100 MHz and independently of the image resolution per frame, 100 objects of size 256×256 pixels are analyzed in a 25 Hz video stream by the architecture.

Download Full-text