THE RECONFIGURABLE RING OF PROCESSORS: EFFICIENT ALGORITHMS VIA HYPERCUBE SIMULATION

1995 ◽  
Vol 05 (01) ◽  
pp. 37-48 ◽  
Author(s):  
ARNOLD L. ROSENBERG ◽  
VITTORIO SCARANO ◽  
RAMESH K. SITARAMAN

We propose a design for, and investigate the computational power of a dynamically reconfigurable parallel computer that we call the Reconfigurable Ring of Processors ([Formula: see text], for short). The [Formula: see text] is a ring of identical processing elements (PEs) that are interconnected via a flexible multi-line reconfigurable bus, each of whose lines has one-packet width and can be configured, independently of the other lines, to establish an arbitrary PE-to-PE connection. A novel aspect of our design is a communication protocol we call COMET — for Cooperative MEssage Transmission — which allows PEs of an [Formula: see text] to exchange one-packet messages with latency that is logarithmic in the number of PEs the message passes over in transit. The main contribution of this paper is an algorithm that allows an N-PE, N-line [Formula: see text] to simulate an N-PE hypercube executing a normal algorithm, with slowdown less than 4 log log N, provided that the local state of a hypercube PE can be encoded and transmitted using a single packet. This simulation provides a rich class of efficient algorithms for the [Formula: see text], including algorithms for matrix multiplication, sorting, and the Fast Fourer Transform (often using fewer than N buslines). The resulting algorithms for the [Formula: see text] are often within a small constant factor of optimal.

2001 ◽  
Vol 12 (03) ◽  
pp. 325-340
Author(s):  
TAKAYOSHI TOUYAMA ◽  
SUSUMU HORIGUCHI

The present super computer will be replaced by a massively parallel computer consisting of a large number of processing elements which satisfy the continuous increasing depend for computing power. Practical parallel computing model has been expected to develop efficient parallel algorithms on massively parallel computers. Thus, we have presented a practical parallel computation model LogPQ by taking account of communication queues into the LogP model. This paper addresses the performance of a parallel matrix multiplication algorithm using LogPQ and LogP models. The parallel algorithm is implemented on Cray T3E and the parallel performances are compared with on the old machine CM-5. This shows that the communication network of T3E has superior buffering behavior than CM-5, in which we don't need to prepare extra buffering on T3E. Although, a little effect remains for both of the send and receive bufferings. On the other hand, the effect of message size remains, which shows the necessity of the overhead and gap proportional to the message size.


2009 ◽  
Vol 19 (03) ◽  
pp. 477-484
Author(s):  
LAURENCE BOXER

We present an algorithm for a permutation exchange operation on a coarse grained parallel computer. Our algorithm is more efficient that a previously published solution to this problem, and enables us to derive an efficient algorithm for matrix multiplication.


Algorithms ◽  
2021 ◽  
Vol 14 (12) ◽  
pp. 347
Author(s):  
Anne Berry ◽  
Geneviève Simonet

The atom graph of a graph is a graph whose vertices are the atoms obtained by clique minimal separator decomposition of this graph, and whose edges are the edges of all possible atom trees of this graph. We provide two efficient algorithms for computing this atom graph, with a complexity in O(min(nωlogn,nm,n(n+m¯)) time, where n is the number of vertices of G, m is the number of its edges, m¯ is the number of edges of the complement of G, and ω, also denoted by α in the literature, is a real number, such that O(nω) is the best known time complexity for matrix multiplication, whose current value is 2,3728596. This time complexity is no more than the time complexity of computing the atoms in the general case. We extend our results to α-acyclic hypergraphs, which are hypergraphs having at least one join tree, a join tree of an hypergraph being defined by its hyperedges in the same way as an atom tree of a graph is defined by its atoms. We introduce the notion of union join graph, which is the union of all possible join trees; we apply our algorithms for atom graphs to efficiently compute union join graphs.


2003 ◽  
Vol 11 (04) ◽  
pp. 521-534 ◽  
Author(s):  
A. TOLSTOY ◽  
W. AU

The Matched Field Processing (MFP) approach to be discussed here is intended to extract subtle differences between apparently similar signals. The technique is applied coherently to an array of data, i.e. to two receivers. One of the main advantages to this work is that even though we use MFP, there is no modeling involved. Since the available binaural data are quite limited and show very strong, obviously different returns from all the targets (not the subtle differences realistically expected), we found it necessary to manipulate the data to bring them more into line with expectations. In particular, scattered returns from a drum were reduced, i.e. multiplied by a small constant factor, then added to the scattered returns from bottom-only data using various time shifts. The shifts simulated a family of returns from a low signal-to-noise (S/N) 55 gallon drum target. This family with shifted bottom scattering mimics returns from multiple placements of the targets on the bottom. These new target "data" (comprised of manipulated real data) seem at first glance to be nearly identical to the original bottom-only returns. Thus, the new target data display subtle differences from the bottom-only data. The MFP approach (based on the linear, a.k.a., Bartlett, processor) was then applied to these new "data". They were processed and yielded a target "template" of scattered returns varying as a function of time and frequency characterizing the returns scattered from the drum. Additionally, a similar template was computed for the buried manta-like target data and is seen to be quite different from the drum template. This new type of template can easily be used to detect scattering from particular target types in low S/N situations. It is not proposed that dolphins are using these templates, but, rather, that the templates display scattering characteristics which the dolphins may be using. More data would be extremely useful in determining the templates under a variety of conditions, e.g. for lower S/N levels, different bottom types, targets types, source ranges, depths, and scattering angles, etc.


Author(s):  
Brendan Juba ◽  
Hai S. Le

Practitioners of data mining and machine learning have long observed that the imbalance of classes in a data set negatively impacts the quality of classifiers trained on that data. Numerous techniques for coping with such imbalances have been proposed, but nearly all lack any theoretical grounding. By contrast, the standard theoretical analysis of machine learning admits no dependence on the imbalance of classes at all. The basic theorems of statistical learning establish the number of examples needed to estimate the accuracy of a classifier as a function of its complexity (VC-dimension) and the confidence desired; the class imbalance does not enter these formulas anywhere. In this work, we consider the measures of classifier performance in terms of precision and recall, a measure that is widely suggested as more appropriate to the classification of imbalanced data. We observe that whenever the precision is moderately large, the worse of the precision and recall is within a small constant factor of the accuracy weighted by the class imbalance. A corollary of this observation is that a larger number of examples is necessary and sufficient to address class imbalance, a finding we also illustrate empirically.


2010 ◽  
Vol 49 (12) ◽  
pp. 2352 ◽  
Author(s):  
Xianchao Wang ◽  
Junjie Peng ◽  
Mei Li ◽  
Zhangyi Shen ◽  
Ouyang Shan

2004 ◽  
Vol 12 (3) ◽  
pp. 169-183 ◽  
Author(s):  
Alexandros V. Gerbessiotis ◽  
Seung-Yeop Lee

In this work we make a strong case for remote memory access (RMA) as the effective way to program a parallel computer by proposing a framework that supports RMA in a library independent, simple and intuitive way. If one uses our approach the parallel code one writes will run transparently under MPI-2 enabled libraries but also bulk-synchronous parallel libraries. The advantage of using RMA is code simplicity, reduced programming complexity, and increased efficiency. We support the latter claims by implementing under this framework a collection of benchmark programs consisting of a communication and synchronization performance assessment program, a dense matrix multiplication algorithm, and two variants of a parallel radix-sort algorithm and examine their performance on a LINUX-based PC cluster under three different RMA enabled libraries: LAM MPI, BSPlib, and PUB. We conclude that implementations of such parallel algorithms using RMA communication primitives lead to code that is as efficient as the message-passing equivalent code and in the case of radix-sort substantially more efficient. In addition our work can be used as a comparative study of the relevant capabilities of the three libraries.


2009 ◽  
Vol 20 (01) ◽  
pp. 45-55
Author(s):  
REGANT Y. S. HUNG ◽  
H. F. TING

The advance of wireless and mobile technology introduces a new type of Video-on-Demand (VOD) systems, namely the mobile VOD systems, that provide VOD services to mobile clients. It is a challenge to design broadcasting protocols for such systems because of the following special requirements: (1) fixed maximum bandwidth requirement: the maximum bandwidth required for broadcasting a movie should be fixed and independent of the number of requests, (2) load adaptivity: the total bandwidth usage should be dependent on the number of requests; the fewer the requests the smaller the total bandwidth usage, and (3) clients sensitivity: the system should be able to support clients with a wide range of heterogeneous capabilities. In the literature, there are some partial solutions that give protocols meeting one or two of the above requirements. In this paper, we give the first protocol that meets all of the three requirements. The performance of our protocol is optimal up to a small constant factor.


2000 ◽  
Vol 10 (04) ◽  
pp. 343-357 ◽  
Author(s):  
JOSEP DIAZ ◽  
JORDI PETIT ◽  
MARIA SERNA

In this paper we analyze the computational power of random geometric networks in the presence of random (edge or node) faults considering several important network parameters. We first analyze how to emulate an original random geometric network G on a faulty network F. Our results state that, under the presence of some natural assumptions, random geometric networks can tolerate a constant node failure probability with a constant slowdown. In the case of constant edge failure probability the slowdown is an arbitrarily small constant times the logarithm of the graph order. Then we consider several network measures, stated as linear layout problems (Bisection, Minimum Linear Arrangement and Minimum Cut Width). Our results show that random geometric networks can tolerate a constant edge (or node) failure probability while maintaining the order of magnitude of the measures considered here. Finally we show that, with high probability, random geometric networks with (edge or node) faults do have a Hamiltonian cycle, provided the failure probability is constant. Such capability enables performing distributed computations based on end-to-end communication protocols.


Sign in / Sign up

Export Citation Format

Share Document