PERFORMANCE EVALUATION OF PRACTICAL PARALLEL COMPUTER MODEL LogPQ

2001 ◽  
Vol 12 (03) ◽  
pp. 325-340
Author(s):  
TAKAYOSHI TOUYAMA ◽  
SUSUMU HORIGUCHI

The present super computer will be replaced by a massively parallel computer consisting of a large number of processing elements which satisfy the continuous increasing depend for computing power. Practical parallel computing model has been expected to develop efficient parallel algorithms on massively parallel computers. Thus, we have presented a practical parallel computation model LogPQ by taking account of communication queues into the LogP model. This paper addresses the performance of a parallel matrix multiplication algorithm using LogPQ and LogP models. The parallel algorithm is implemented on Cray T3E and the parallel performances are compared with on the old machine CM-5. This shows that the communication network of T3E has superior buffering behavior than CM-5, in which we don't need to prepare extra buffering on T3E. Although, a little effect remains for both of the send and receive bufferings. On the other hand, the effect of message size remains, which shows the necessity of the overhead and gap proportional to the message size.

2004 ◽  
Vol 12 (3) ◽  
pp. 169-183 ◽  
Author(s):  
Alexandros V. Gerbessiotis ◽  
Seung-Yeop Lee

In this work we make a strong case for remote memory access (RMA) as the effective way to program a parallel computer by proposing a framework that supports RMA in a library independent, simple and intuitive way. If one uses our approach the parallel code one writes will run transparently under MPI-2 enabled libraries but also bulk-synchronous parallel libraries. The advantage of using RMA is code simplicity, reduced programming complexity, and increased efficiency. We support the latter claims by implementing under this framework a collection of benchmark programs consisting of a communication and synchronization performance assessment program, a dense matrix multiplication algorithm, and two variants of a parallel radix-sort algorithm and examine their performance on a LINUX-based PC cluster under three different RMA enabled libraries: LAM MPI, BSPlib, and PUB. We conclude that implementations of such parallel algorithms using RMA communication primitives lead to code that is as efficient as the message-passing equivalent code and in the case of radix-sort substantially more efficient. In addition our work can be used as a comparative study of the relevant capabilities of the three libraries.


1992 ◽  
Vol 278 ◽  
Author(s):  
K. M. Nelson ◽  
S. T. Smith ◽  
L. T. Wille

AbstractWe report the results of computer simulations of phase transitions in noble-gas clusters. The calculations were performed on a MasPar MP-l massively parallel computer with 8,192 processing elements (PE's). We discuss the efficient implementation of molecular dynamics algorithms for small clusters on this type of architecture. The simulations are based on a classical Lennard-Jones pair potential and follow the temporal evolution of the system by numerically integrating Newton's equations of motion using the Gear algorithm. Because the number of particles is much smaller than the number of PE's, optimal partitioning of the processing element array is an essential and non-trivial task.


1997 ◽  
Vol 08 (02) ◽  
pp. 143-162 ◽  
Author(s):  
Pascal Berthomé ◽  
Afonso Ferreira

In classical massively parallel computers, the complexity of the interconnection networks is much higher than the complexity of the processing elements themselves. However, emerging optical technologies may provide a way to reconsider very large parallel architectures where processors would communicate by optical means. In this paper, we compare some optically interconnected parallel multicomputer models with regard to their communication capabilities. We first establish a distinction of such systems, based on the independence of the communication elements embedded in the processors (transmitters and receivers). Then, motivated by the fact that in multicomputers some communication operations have to be very efficiently performed, we study communication problems, namely, broadcast and multi-broadcast, under the hypothesis of bounded fanout. Our results take also into account a bounded number of available wavelengths.


1995 ◽  
Vol 05 (01) ◽  
pp. 37-48 ◽  
Author(s):  
ARNOLD L. ROSENBERG ◽  
VITTORIO SCARANO ◽  
RAMESH K. SITARAMAN

We propose a design for, and investigate the computational power of a dynamically reconfigurable parallel computer that we call the Reconfigurable Ring of Processors ([Formula: see text], for short). The [Formula: see text] is a ring of identical processing elements (PEs) that are interconnected via a flexible multi-line reconfigurable bus, each of whose lines has one-packet width and can be configured, independently of the other lines, to establish an arbitrary PE-to-PE connection. A novel aspect of our design is a communication protocol we call COMET — for Cooperative MEssage Transmission — which allows PEs of an [Formula: see text] to exchange one-packet messages with latency that is logarithmic in the number of PEs the message passes over in transit. The main contribution of this paper is an algorithm that allows an N-PE, N-line [Formula: see text] to simulate an N-PE hypercube executing a normal algorithm, with slowdown less than 4 log log N, provided that the local state of a hypercube PE can be encoded and transmitted using a single packet. This simulation provides a rich class of efficient algorithms for the [Formula: see text], including algorithms for matrix multiplication, sorting, and the Fast Fourer Transform (often using fewer than N buslines). The resulting algorithms for the [Formula: see text] are often within a small constant factor of optimal.


Author(s):  
Robert E. Fulton ◽  
Philip S. Su

Abstract New massively parallel computer architectures have revolutionized the design of computer algorithms, and promise to have significant influence on algorithms for engineering computations. The traditional global model parallel method has a limited benefit for massively parallel computers. An alternative method is to use the substructure approach. This paper is to explore the potential for substructure strategy through actual examples. Each substructure is mapped on to some processors of a MIMD parallel computer. The internal nodes variables will be condensed into boundary nodes variables in each substructure. All substructures computations can be performed in parallel until the global boundary system equation is formed. A direct solution strategy for the global boundary displacements is performed. The final internal nodes displacements solution in each substructure can be performed in parallel. Examples for two-dimensional static analysis are presented on a BBN Butterfly GP1000 parallel computer.


Author(s):  
Jose-Maria Carazo ◽  
I. Benavides ◽  
S. Marco ◽  
J.L. Carrascosa ◽  
E.L. Zapata

Obtaining the three-dimensional (3D) structure of negatively stained biological specimens at a resolution of, typically, 2 - 4 nm is becoming a relatively common practice in an increasing number of laboratories. A combination of new conceptual approaches, new software tools, and faster computers have made this situation possible. However, all these 3D reconstruction processes are quite computer intensive, and the middle term future is full of suggestions entailing an even greater need of computing power. Up to now all published 3D reconstructions in this field have been performed on conventional (sequential) computers, but it is a fact that new parallel computer architectures represent the potential of order-of-magnitude increases in computing power and should, therefore, be considered for their possible application in the most computing intensive tasks.We have studied both shared-memory-based computer architectures, like the BBN Butterfly, and local-memory-based architectures, mainly hypercubes implemented on transputers, where we have used the algorithmic mapping method proposed by Zapata el at. In this work we have developed the basic software tools needed to obtain a 3D reconstruction from non-crystalline specimens (“single particles”) using the so-called Random Conical Tilt Series Method. We start from a pair of images presenting the same field, first tilted (by ≃55°) and then untilted. It is then assumed that we can supply the system with the image of the particle we are looking for (ideally, a 2D average from a previous study) and with a matrix describing the geometrical relationships between the tilted and untilted fields (this step is now accomplished by interactively marking a few pairs of corresponding features in the two fields). From here on the 3D reconstruction process may be run automatically.


Sign in / Sign up

Export Citation Format

Share Document