scholarly journals Design and implementation of fast and hardware‐efficient parallel processing elements to set full and partial permutations in Beneš networks

Author(s):  
Labson Koloko ◽  
Takahiro Matsumoto ◽  
Hitoshi Obara
2009 ◽  
Vol 17 (1-2) ◽  
pp. 43-57 ◽  
Author(s):  
Michael Kistler ◽  
John Gunnels ◽  
Daniel Brokenshire ◽  
Brad Benton

In this paper we present the design and implementation of the Linpack benchmark for the IBM BladeCenter QS22, which incorporates two IBM PowerXCell 8i1processors. The PowerXCell 8i is a new implementation of the Cell Broadband Engine™2 architecture and contains a set of special-purpose processing cores known as Synergistic Processing Elements (SPEs). The SPEs can be used as computational accelerators to augment the main PowerPC processor. The added computational capability of the SPEs results in a peak double precision floating point capability of 108.8 GFLOPS. We explain how we modified the standard open source implementation of Linpack to accelerate key computational kernels using the SPEs of the PowerXCell 8i processors. We describe in detail the implementation and performance of the computational kernels and also explain how we employed the SPEs for high-speed data movement and reformatting. The result of these modifications is a Linpack benchmark optimized for the IBM PowerXCell 8i processor that achieves 170.7 GFLOPS on a BladeCenter QS22 with 32 GB of DDR2 SDRAM memory. Our implementation of Linpack also supports clusters of QS22s, and was used to achieve a result of 11.1 TFLOPS on a cluster of 84 QS22 blades. We compare our results on a single BladeCenter QS22 with the base Linpack implementation without SPE acceleration to illustrate the benefits of our optimizations.


2019 ◽  
Vol 8 (2S11) ◽  
pp. 2858-2863

The main goal of this article is to implement an effective Non-Blocking Benes switching Network. Benes Switching Network is designed with the uncomplicated switch modules & it’s have so many advantages, small latency, less traffic and it’s required number of switch modules. Clos and Benes networks are play a key role in the class of multistage interconnection network because of their extensibility and mortality. Benes network provides a low latency when compare with the other networks. 8x8 Benes non blocking switching network is designed and synthesized with the using of Xilinx tool 12.1.


2010 ◽  
Vol 8 ◽  
pp. 135-142 ◽  
Author(s):  
H. Flatt ◽  
A. Tarnowsky ◽  
H. Blume ◽  
P. Pirsch

Abstract. Dieser Beitrag behandelt die Abbildung eines videobasierten Verfahrens zur echtzeitfähigen Auswertung von Winkelhistogrammen auf eine modulare Coprozessor-Architektur. Die Architektur besteht aus mehreren dedizierten Recheneinheiten zur parallelen Verarbeitung rechenintensiver Bildverarbeitungsverfahren und ist mit einem RISC-Prozessor verbunden. Eine konfigurierbare Architekturerweiterung um eine Recheneinheit zur Auswertung von Winkelhistogrammen von Objekten ermöglicht in Verbindung mit dem RISC eine echtzeitfähige Klassifikation. Je nach Konfiguration sind für die Architekturerweiterung auf einem Xilinx Virtex-5-FPGA zwischen 3300 und 12 000 Lookup-Tables erforderlich. Bei einer Taktfrequenz von 100 MHz können unabhängig von der Bildauflösung pro Einzelbild in einem 25-Hz-Videodatenstrom bis zu 100 Objekte der Größe 256×256 Pixel analysiert werden. This paper presents the mapping of a video-based approach for real-time evaluation of angular histograms on a modular coprocessor architecture. The architecture comprises several dedicated processing elements for parallel processing of computation-intensive image processing tasks and is coupled with a RISC processor. A configurable architecture extension, especially a processing element for evaluating angular histograms of objects in conjunction with a RISC processor, provides a real-time classification. Depending on the configuration of the architecture extension, 3 300 to 12 000 look-up tables are required for a Xilinx Virtex-5 FPGA implementation. Running at a clock frequency of 100 MHz and independently of the image resolution per frame, 100 objects of size 256×256 pixels are analyzed in a 25 Hz video stream by the architecture.


Sign in / Sign up

Export Citation Format

Share Document