scholarly journals Design and Implementation of a Large Scale Tree-Based QR Decomposition Using a 3D Virtual Systolic Array and a Lightweight Runtime

2014 ◽  
Vol 24 (04) ◽  
pp. 1442004
Author(s):  
Ichitaro Yamazaki ◽  
Jakub Kurzak ◽  
Piotr Luszczek ◽  
Jack Dongarra

A systolic array provides an alternative computing paradigm to the von Neumann architecture. Though its hardware implementation has failed as a paradigm to design integrated circuits in the past, we are now discovering that the systolic array as a software virtualization layer can lead to an extremely scalable execution paradigm. To demonstrate this scalability, in this paper, we design and implement a 3D virtual systolic array to compute a tile QR decomposition of a tall-and-skinny dense matrix. Our implementation is based on a state-of-the-art algorithm that factorizes a panel based on a tree-reduction. Freed from the constraint of a planar layout, we present a three-dimensional virtual systolic array architecture for this algorithm. Using a runtime developed as a part of the Parallel Ultra Light Systolic Array Runtime (PULSAR) project, we demonstrate on a Cray-XT5 machine how our virtual systolic array can be mapped to a large-scale machine and obtain excellent parallel performance. This is an important contribution since such a QR decomposition is used, for example, to compute a least squares solution of an overdetermined system, which arises in many scientific and engineering problems.

Author(s):  
Ziling Wang ◽  
Li Luo ◽  
Jie Li ◽  
Lidan Wang ◽  
shukai duan

Abstract In-memory computing is highly expected to break the von Neumann bottleneck and memory wall. Memristor with inherent nonvolatile property is considered to be a strong candidate to execute this new computing paradigm. In this work, we have presented a reconfigurable nonvolatile logic method based on one-transistor-two-memristor (1T2M) device structure, inhibiting the sneak path in the large-scale crossbar array. By merely adjusting the applied voltage signals, all 16 binary Boolean logic functions can be achieved in a single cell. More complex computing tasks including one-bit parallel full adder and Set-Reset latch have also been realized with optimization, showing simple operation process, high flexibility, and low computational complexity. The circuit verification based on cadence PSpice simulation is also provided, proving the feasibility of the proposed design. The work in this paper is intended to make progress in constructing architectures for in-memory computing paradigm.


2007 ◽  
Vol 558-559 ◽  
pp. 1177-1181 ◽  
Author(s):  
Philippe Schaffnit ◽  
Markus Apel ◽  
Ingo Steinbach

The kinetics and topology of ideal grain growth were simulated using the phase-field model. Large scale phase-field simulations were carried out where ten thousands grains evolved into a few hundreds without allowing coalescence of grains. The implementation was first validated in two-dimensions by checking the conformance with square-root evolution of the average grain size and the von Neumann-Mullins law. Afterwards three-dimensional simulations were performed which also showed fair agreement with the law describing the evolution of the mean grain size against time and with the results of S. Hilgenfeld et al. in 'An Accurate von Neumann's Law for Three-Dimensional Foams', Phys. Rev. Letters, 86(12)/2685, March 2001. Finally the steady state grain size distribution was investigated and compared to the Hillert theory.


1993 ◽  
Vol 2 (3) ◽  
pp. 23-35
Author(s):  
Allan R. Larrabee

The first digital computers consisted of a single processor acting on a single stream of data. In this so-called "von Neumann" architecture, computation speed is limited mainly by the time required to transfer data between the processor and memory. This limiting factor has been referred to as the "von Neumann bottleneck". The concern that the miniaturization of silicon-based integrated circuits will soon reach theoretical limits of size and gate times has led to increased interest in parallel architectures and also spurred research into alternatives to silicon-based implementations of processors. Meanwhile, sequential processors continue to be produced that have increased clock rates and an increase in memory locally available to a processor, and an increase in the rate at which data can be transferred to and from memories, networks, and remote storage. The efficiency of compilers and operating systems is also improving over time. Although such characteristics limit maximum performance, a large improvement in the speed of scientific computations can often be achieved by utilizing more efficient algorithms, particularly those that support parallel computation. This work discusses experiences with two tools for large grain (or "macro task") parallelism.


Sign in / Sign up

Export Citation Format

Share Document