High-Performance Implementation

In Chapter 1, we saw how Lucid could be used to express solutions to standard problems such as sorting and matrix multiplication. One of the unique characteristics of Lucid is not only that it can be used as a programming language but it can also be used as a “composition” language. That is, instead of using Lucid to specify computations, it can be used to express how computation components (expressed in some other language) can be “glued” together to form a coherent application. By doing so, the resulting application can enjoy some of the practical benefits attributable to Lucid such as high performance through exploitation of implicit parallelism and robustness through software fault tolerance. In this chapter, we discuss one such use of Lucid—as part of a hybrid language to construct parallel applications to be executed on conventional parallel computers. A conventional parallel computer either consists of a number of processors each with local memory interconnected by a network (as in distributed memory architectures) or a number of processors that share memory possibly using an interconnection network (as in shared memory architectures). The past decade has seen the advent of conventional parallel computers starting with the Denelcor HEP evolving to the CM-2 and Intel Hypercube and further evolving to the CM-5, Intel Paragon, Cray T3D, and IBM SP-2. Even networks of workstations (or workstation clusters) are seen as low-cost (“poor man’s”) parallel computers. Programming of conventional parallel computers has proven to be far more challenging than had been expected. Part of the reason is the continued use of low-level, explicitly parallel programming models such as PVM [42], Linda [10]. Two factors have fueled the continuing use of such languages despite their limited success. 1. The need to reuse existing sequential code because the cost of rewriting legacy applications from scratch is considered prohibitive both in economic and technical terms. 2. The need to run on conventional parallel computers that view a “parallel program” at a low level—as consisting of sequential processes that frequently synchronize and communicate with each other using some form of message passing.

Download Full-text

A Universal, Dynamically Adaptable and Programmable Network Router for Parallel Computers

VLSI Design ◽

10.1155/2001/50167 ◽

2001 ◽

Vol 12 (1) ◽

pp. 25-52 ◽

Cited By ~ 4

Author(s):

Taras I. Golota ◽

Sotirios G. Ziavras

Keyword(s):

Message Passing ◽

High Performance ◽

Interconnection Network ◽

Vlsi Design ◽

Parallel Computers ◽

System Level ◽

Current Configuration ◽

Data Channel ◽

Vhdl Code ◽

Or Applications

Existing message-passing parallel computers employ routers designed for a specific interconnection network and deal with fixed data channel width. There are disadvantages to this approach, because the system design and development times are significant and these routers do not permit run time network reconfiguration. Changes in the topology of the network may be required for better performance or faulttolerance. In this paper, we introduce a class of high-performance universal (statically and dynamically adaptable) programmable routers (UPRs) for message-passing parallel computers. The universality of these routers is based on their capability to adapt at run and/or static times according to the characteristics of the systems and/or applications. More specifically, the number of bidirectional data channels, the channel size and the I/O port mappings (for the implementation of a particular topology) can change dynamically and statically. Our research focuses on system-level specification issues of the UPRs, their VLSI design and their simulation to estimate their performance. Our simulation of data transfers via UPR routers employs VHDL code in the Mentor Graphics environment. The results show that the performance of the routers depends mostly on their current configuration. Details of the simulation and synthesis are presented.

Download Full-text

Design and evaluation of a high performance file system for message passing parallel computers

[1991] Proceedings. The Fifth International Parallel Processing Symposium ◽

10.1109/ipps.1991.153835 ◽

2002 ◽

Cited By ~ 1

Author(s):

U. Nagaraj ◽

U.S. Shukla ◽

A. Paulraj

Keyword(s):

Message Passing ◽

High Performance ◽

File System ◽

Parallel Computers

Download Full-text

TTN: A High Performance Hierarchical Interconnection Network for Massively Parallel Computers

IEICE Transactions on Information and Systems ◽

10.1587/transinf.e92.d.1062 ◽

2009 ◽

Vol E92-D (5) ◽

pp. 1062-1078 ◽

Cited By ~ 14

Author(s):

M.M. Hafizur RAHMAN ◽

Yasushi INOGUCHI ◽

Yukinori SATO ◽

Susumu HORIGUCHI

Keyword(s):

High Performance ◽

Interconnection Network ◽

Parallel Computers ◽

Massively Parallel ◽

Massively Parallel Computers

Download Full-text

Multi-Softcore Architecture on FPGA

International Journal of Reconfigurable Computing ◽

10.1155/2014/979327 ◽

2014 ◽

Vol 2014 ◽

pp. 1-13 ◽

Cited By ~ 4

Author(s):

Mouna Baklouti ◽

Mohamed Abid

Keyword(s):

High Performance ◽

Design Methodology ◽

Matrix Multiplication ◽

Rapid Prototype ◽

General Purpose ◽

Parallel Applications ◽

Multicore Systems ◽

Processor Core ◽

Nios Ii ◽

Wide Range

To meet the high performance demands of embedded multimedia applications, embedded systems are integrating multiple processing units. However, they are mostly based on custom-logic design methodology. Designing parallel multicore systems using available standards intellectual properties yet maintaining high performance is also a challenging issue. Softcore processors and field programmable gate arrays (FPGAs) are a cheap and fast option to develop and test such systems. This paper describes a FPGA-based design methodology to implement a rapid prototype of parametric multicore systems. A study of the viability of making the SoC using the NIOS II soft-processor core from Altera is also presented. The NIOS II features a general-purpose RISC CPU architecture designed to address a wide range of applications. The performance of the implemented architecture is discussed, and also some parallel applications are used for testing speedup and efficiency of the system. Experimental results demonstrate the performance of the proposed multicore system, which achieves better speedup than the GPU (29.5% faster for the FIR filter and 23.6% faster for the matrix-matrix multiplication).

Download Full-text

A framework for high-performance matrix multiplication based on hierarchical abstractions, algorithms and optimized low-level kernels

Concurrency and Computation Practice and Experience ◽

10.1002/cpe.630 ◽

2002 ◽

Vol 14 (10) ◽

pp. 805-839 ◽

Cited By ~ 21

Author(s):

Vinod Valsalam ◽

Anthony Skjellum

Keyword(s):

High Performance ◽

Matrix Multiplication ◽

Low Level ◽

Performance Matrix

Download Full-text

FACILITATING HIGH-PERFORMANCE IMAGE ANALYSIS ON REDUCED HYPERCUBE (RH) PARALLEL COMPUTERS

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001495000262 ◽

1995 ◽

Vol 09 (04) ◽

pp. 679-698 ◽

Cited By ~ 3

Author(s):

SOTIRIOS G. ZIAVRAS ◽

MICHALIS A. SIDERAS

Keyword(s):

Image Analysis ◽

Interconnection Networks ◽

High Performance ◽

Interconnection Network ◽

Parallel Computers ◽

Number Of Publications ◽

Low Dimensional ◽

High Level ◽

The Cost ◽

Cube Connected Cycles

The direct binary hypercube interconnection network has been very popular for the design of parallel computers, because it provides a low diameter and can emulate efficiently the majority of the topologies frequently employed in the development of algorithms. The last fifteen years have seen major efforts to develop image analysis algorithms for hypercube-based parallel computers. The results of these efforts have culminated in a large number of publications included in prestigious scholarly journals and conference proceedings. Nevertheless, the aforementioned powerful properties of the hypercube come at the cost of high VLSI complexity due to the increase in the number of communication ports and channels per PE (processing element) with an increase in the total number of PE’s. The high VLSI complexity of hypercube systems is undoubtedly their dominant drawback; it results in the construction of systems that contain either a large number of primitive PE’s or a small number of powerful PE’s. Therefore, low-dimensional k-ary n-cubes with lower VSLI complexity have recently drawn the attention of many designers of parallel computers. Alternative solutions reduce the hypercube’s VLSI complexity without jeopardizing its performance. Such an effort by Ziavras has resulted in the introduction of reduced hypercubes (RH’s). Taking advantage of existing high-performance routing techniques, such as wormhole routing, an RH is obtained by a uniform reduction in the number of edges for each hypercube node. An RH can also be viewed as several connected copies of the well-known cube-connected-cycles network. The objective here is to prove that parallel computers comprising RH interconnection networks are definitely good choices for all levels of image analysis. Since the exact requirements of high-level image analysis are difficult to identify, but it is believed that versatile interconnection networks, such as the hypercube, are suitable for relevant tasks, we investigate the problem of emulating hypercubes on RH’s. The ring (or linear array), the torus (or mesh), and the binary tree are the most frequently used topologies for the development of algorithms in low-level and intermediate-level image analysis. Thus, to prove the viability of the RH for the two lower levels of image analysis, we introduce techniques for embedding the aforementioned three topologies into RH’s. The results prove the suitability of RH’s for all levels of image analysis.

Download Full-text

Employing MPI_T in MPI Advisor to optimize application performance

The International Journal of High Performance Computing Applications ◽

10.1177/1094342016684005 ◽

2017 ◽

Vol 32 (6) ◽

pp. 882-896 ◽

Cited By ~ 1

Author(s):

Esthela Gallardo ◽

Jérôme Vienne ◽

Leonardo Fialho ◽

Patricia Teller ◽

James Browne

Keyword(s):

Performance Optimization ◽

Message Passing ◽

High Performance ◽

Message Passing Interface ◽

Expert Knowledge ◽

Parallel Applications ◽

Communication Behaviors ◽

Application Performance ◽

Impact Performance ◽

Runtime Environment

MPI_T, the MPI Tool Information Interface, was introduced in the MPI 3.0 standard with the aim of enabling the development of more effective tools to support the Message Passing Interface (MPI), a standardized and portable message-passing system that is widely used in parallel programs. Most MPI optimization tools do not yet employ MPI_T and only describe the interactions between an application and an MPI library, thus requiring that users have expert knowledge to translate this information into optimizations. In contrast, MPI Advisor, a recently developed, easy-to-use methodology and tool for MPI performance optimization, pioneered the use of information provided by MPI_T to characterize the communication behaviors of an application and identify an MPI configuration that may enhance application performance. In addition to enabling the recommendation of performance optimizations, MPI_T has the potential to enable automatic runtime application of these optimizations. Optimization of MPI configurations is important because: (1) the vast majority of parallel applications executed on high-performance computing clusters use MPI for communication among processes, (2) most users execute their programs using the cluster’s default MPI configuration, and (3) while default configurations may give adequate performance, it is well known that optimizing the MPI runtime environment can significantly improve application performance, in particular, when the way in which the application is executed and/or the application’s input changes. This paper provides an overview of MPI_T, describes how it can be used to develop more effective MPI optimization tools, and demonstrates its use within an extended version of MPI Advisor. In doing the latter, it presents several MPI configuration choices that can significantly impact performance, shows how use of information collected at runtime with MPI_T and PMPI can be used to enhance performance, and presents MPI Advisor case studies of these configuration optimizations with performance gains of up to 40%.

Download Full-text

EXECUTION OF SEQUENTIAL AND PARALLEL JAVA BYTECODE IN A METACOMPUTING SYSTEM

Parallel Processing Letters ◽

10.1142/s0129626403001148 ◽

2003 ◽

Vol 13 (01) ◽

pp. 53-64 ◽

Cited By ~ 1

Author(s):

ERIC GAMESS

Keyword(s):

Linear Algebra ◽

Virtual Machine ◽

Message Passing ◽

High Performance ◽

Scientific Computing ◽

Message Passing Interface ◽

Java Virtual Machine ◽

Parallel Applications ◽

Beowulf Cluster ◽

Java Bytecode

In this paper, we address the goal of executing Java parallel applications in a group of nodes of a Beowulf cluster transparently chosen by a metacomputing system oriented to efficient execution of Java bytecode, with support for scientific computing. To this end, we extend the Java virtual machine by providing a message passing interface and quick access to distributed high performance resources. Also, we introduce the execution of parallel linear algebra methods for large objects from sequential Java applications by invoking SPLAM, our parallel linear algebra package.

Download Full-text

COMMUNICATION PERFORMANCE OF LAM/MPI AND MPICH ON A LINUX CLUSTER

Parallel Processing Letters ◽

10.1142/s0129626406002678 ◽

2006 ◽

Vol 16 (03) ◽

pp. 323-334

Author(s):

IGOR ROZMAN ◽

MARJAN ŠTERK ◽

ROMAN TROBEC

Keyword(s):

Computer Simulations ◽

High Performance ◽

Parallel Computers ◽

Parallel Computer ◽

Parallel Applications ◽

Software Library ◽

Communication Performance ◽

Computer Cluster ◽

Parallel Performance ◽

Linux Cluster

High performance parallel computers provide computational rates necessary for computer simulations and intensive computing applications. An important part of a parallel computer program is an MPI software library, which implements communication within parallel applications. Several MPI implementations exist, most widely used among them are LAM/MPI and MPICH. This paper presents results of four basic synthetic tests and two real simulations in LAM/MPI and MPICH environments. Tests were made on a computer cluster composed of 17 dual-processor nodes connected by a toroidal mesh. Results show that on the investigated cluster, LAM outperformed MPICH especially by bidirectional ring communication, and that appropriate trimming of communication parameters significantly contributes to the final parallel performance.

Download Full-text

Parallel simulation

10.1093/oso/9780198803195.003.0007 ◽

2017 ◽

Author(s):

Michael P. Allen ◽

Dominic J. Tildesley

Keyword(s):

Shared Memory ◽

Message Passing ◽

High Performance ◽

Distributed Memory ◽

Nested Loops ◽

Code Domain ◽

Basic Approaches ◽

Effective Use ◽

Memory Architectures ◽

Performance Computing

Parallelization is essential for the effective use of modern high-performance computing facilities. This chapter summarizes some of the basic approaches that are commonly used in molecular simulation programs. The underlying shared-memory and distributed-memory architectures are explained. The concept of program threads and their use in parallelizing nested loops on a shared memory machine is described. Parallel tempering using message passing on a distributed memory machine is discussed and illustrated with an example code. Domain decomposition, and the implementation of constraints on parallel computers, are also explained.

Download Full-text