Structural Outlooks for the OTIS-Arrangement Network

Author(s):  
Ahmad Awwad ◽  
Jehad Al-Sadi ◽  
Bassam Haddad ◽  
Ahmad Kayed

Recent studies have revealed that the Optical Transpose Interconnection Systems (OTIS) are promising candidates for future high-performance parallel computers. This paper presents and evaluates a general method for algorithm development on the OTIS-Arrangement network (OTIS-AN) as an example of OTIS network. The proposed method can be used and customized for any other OTIS network. Furthermore, it allows efficient mapping of a wide class of algorithms into the OTIS-AN. This method is based on grids and pipelines as popular structures that support a vast body of parallel applications including linear algebra, divide-and-conquer types of algorithms, sorting, and FFT computation. This study confirms the viability of the OTIS-AN as an attractive alternative for large-scale parallel architectures.

2011 ◽  
Vol 3 (2) ◽  
pp. 59-68 ◽  
Author(s):  
Ahmad Awwad ◽  
Jehad Al-Sadi ◽  
Bassam Haddad ◽  
Ahmad Kayed

Recent studies have revealed that the Optical Transpose Interconnection Systems (OTIS) are promising candidates for future high-performance parallel computers. This paper presents and evaluates a general method for algorithm development on the OTIS-Arrangement network (OTIS-AN) as an example of OTIS network. The proposed method can be used and customized for any other OTIS network. Furthermore, it allows efficient mapping of a wide class of algorithms into the OTIS-AN. This method is based on grids and pipelines as popular structures that support a vast body of parallel applications including linear algebra, divide-and-conquer types of algorithms, sorting, and FFT computation. This study confirms the viability of the OTIS-AN as an attractive alternative for large-scale parallel architectures.


Author(s):  
Mark Endrei ◽  
Chao Jin ◽  
Minh Ngoc Dinh ◽  
David Abramson ◽  
Heidi Poxon ◽  
...  

Rising power costs and constraints are driving a growing focus on the energy efficiency of high performance computing systems. The unique characteristics of a particular system and workload and their effect on performance and energy efficiency are typically difficult for application users to assess and to control. Settings for optimum performance and energy efficiency can also diverge, so we need to identify trade-off options that guide a suitable balance between energy use and performance. We present statistical and machine learning models that only require a small number of runs to make accurate Pareto-optimal trade-off predictions using parameters that users can control. We study model training and validation using several parallel kernels and more complex workloads, including Algebraic Multigrid (AMG), Large-scale Atomic Molecular Massively Parallel Simulator, and Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics. We demonstrate that we can train the models using as few as 12 runs, with prediction error of less than 10%. Our AMG results identify trade-off options that provide up to 45% improvement in energy efficiency for around 10% performance loss. We reduce the sample measurement time required for AMG by 90%, from 13 h to 74 min.


1999 ◽  
Vol 09 (02) ◽  
pp. 243-252 ◽  
Author(s):  
O. LARSSON ◽  
M. FEIG ◽  
L. JOHNSSON

We demonstrate good metacomputing efficiency and portability for three typical large-scale parallel applications; one molecular dynamics code and two electromagnetics codes. The codes were developed for distributed memory parallel platforms using Fortran77 or Fortran90 with MPI. The performance measurements were made for a testbed of two IBM SPs connected through the vBNS. No change of the application codes were required for correct execution of the codes on the testbed using the Globus Toolkit for the required metacomputing services. However, we observe that for good performance, it may be necessary for MPI codes to make use of overlapped computation and communication. For such MPI codes, a communications library designed for hierarchical or clustered communication can yield very good metacomputing efficiencies when high-performance networks, such as the vBNS or the Abilene networks, such as the vBNS or the Abilene networks, are used for platform connectivity. We demonstrate this by inserting a thin layer between the MPI application and the MPI libraries, providing some clustering of communications between platforms.


2003 ◽  
Vol 13 (01) ◽  
pp. 53-64 ◽  
Author(s):  
ERIC GAMESS

In this paper, we address the goal of executing Java parallel applications in a group of nodes of a Beowulf cluster transparently chosen by a metacomputing system oriented to efficient execution of Java bytecode, with support for scientific computing. To this end, we extend the Java virtual machine by providing a message passing interface and quick access to distributed high performance resources. Also, we introduce the execution of parallel linear algebra methods for large objects from sequential Java applications by invoking SPLAM, our parallel linear algebra package.


2006 ◽  
Vol 16 (03) ◽  
pp. 323-334
Author(s):  
IGOR ROZMAN ◽  
MARJAN ŠTERK ◽  
ROMAN TROBEC

High performance parallel computers provide computational rates necessary for computer simulations and intensive computing applications. An important part of a parallel computer program is an MPI software library, which implements communication within parallel applications. Several MPI implementations exist, most widely used among them are LAM/MPI and MPICH. This paper presents results of four basic synthetic tests and two real simulations in LAM/MPI and MPICH environments. Tests were made on a computer cluster composed of 17 dual-processor nodes connected by a toroidal mesh. Results show that on the investigated cluster, LAM outperformed MPICH especially by bidirectional ring communication, and that appropriate trimming of communication parameters significantly contributes to the final parallel performance.


2021 ◽  
Vol 47 (3) ◽  
pp. 1-23
Author(s):  
Ahmad Abdelfattah ◽  
Timothy Costa ◽  
Jack Dongarra ◽  
Mark Gates ◽  
Azzam Haidar ◽  
...  

This article describes a standard API for a set of Batched Basic Linear Algebra Subprograms (Batched BLAS or BBLAS). The focus is on many independent BLAS operations on small matrices that are grouped together and processed by a single routine, called a Batched BLAS routine. The matrices are grouped together in uniformly sized groups, with just one group if all the matrices are of equal size. The aim is to provide more efficient, but portable, implementations of algorithms on high-performance many-core platforms. These include multicore and many-core CPU processors, GPUs and coprocessors, and other hardware accelerators with floating-point compute facility. As well as the standard types of single and double precision, we also include half and quadruple precision in the standard. In particular, half precision is used in many very large scale applications, such as those associated with machine learning.


2008 ◽  
Vol 16 (2-3) ◽  
pp. 105-121 ◽  
Author(s):  
Martin Schulz ◽  
Jim Galarowicz ◽  
Don Maghrak ◽  
William Hachfeld ◽  
David Montoya ◽  
...  

Over the last decades a large number of performance tools has been developed to analyze and optimize high performance applications. Their acceptance by end users, however, has been slow: each tool alone is often limited in scope and comes with widely varying interfaces and workflow constraints, requiring different changes in the often complex build and execution infrastructure of the target application. We started the Open | SpeedShop project about 3 years ago to overcome these limitations and provide efficient, easy to apply, and integrated performance analysis for parallel systems. Open | SpeedShop has two different faces: it provides an interoperable tool set covering the most common analysis steps as well as a comprehensive plugin infrastructure for building new tools. In both cases, the tools can be deployed to large scale parallel applications using DPCL/Dyninst for distributed binary instrumentation. Further, all tools developed within or on top of Open | SpeedShop are accessible through multiple fully equivalent interfaces including an easy-to-use GUI as well as an interactive command line interface reducing the usage threshold for those tools.


2012 ◽  
Vol 2012 ◽  
pp. 1-18 ◽  
Author(s):  
Xiaocheng Liu ◽  
Bin Chen ◽  
Xiaogang Qiu ◽  
Ying Cai ◽  
Kedi Huang

An increasing number of high performance computing parallel applications leverages the power of the cloud for parallel processing. How to schedule the parallel applications to improve the quality of service is the key to the successful host of parallel applications in the cloud. The large scale of the cloud makes the parallel job scheduling more complicated as even simple parallel job scheduling problem is NP-complete. In this paper, we propose a parallel job scheduling algorithm named MEASY. MEASY adopts migration and consolidation to enhance the most popular EASY scheduling algorithm. Our extensive experiments on well-known workloads show that our algorithm takes very good care of the quality of service. For two common parallel job scheduling objectives, our algorithm produces an up to 41.1% and an average of 23.1% improvement on the average response time; an up to 82.9% and an average of 69.3% improvement on the average slowdown. Our algorithm is robust even in terms that it allows inaccurate CPU usage estimation and high migration cost. Our approach involves trivial modification on EASY and requires no additional technique; it is practical and effective in the cloud environment.


Author(s):  
E. A. Ashcroft ◽  
A. A. Faustini ◽  
R. Jaggannathan ◽  
W. W. Wadge

In Chapter 1, we saw how Lucid could be used to express solutions to standard problems such as sorting and matrix multiplication. One of the unique characteristics of Lucid is not only that it can be used as a programming language but it can also be used as a “composition” language. That is, instead of using Lucid to specify computations, it can be used to express how computation components (expressed in some other language) can be “glued” together to form a coherent application. By doing so, the resulting application can enjoy some of the practical benefits attributable to Lucid such as high performance through exploitation of implicit parallelism and robustness through software fault tolerance. In this chapter, we discuss one such use of Lucid—as part of a hybrid language to construct parallel applications to be executed on conventional parallel computers. A conventional parallel computer either consists of a number of processors each with local memory interconnected by a network (as in distributed memory architectures) or a number of processors that share memory possibly using an interconnection network (as in shared memory architectures). The past decade has seen the advent of conventional parallel computers starting with the Denelcor HEP evolving to the CM-2 and Intel Hypercube and further evolving to the CM-5, Intel Paragon, Cray T3D, and IBM SP-2. Even networks of workstations (or workstation clusters) are seen as low-cost (“poor man’s”) parallel computers. Programming of conventional parallel computers has proven to be far more challenging than had been expected. Part of the reason is the continued use of low-level, explicitly parallel programming models such as PVM [42], Linda [10]. Two factors have fueled the continuing use of such languages despite their limited success. 1. The need to reuse existing sequential code because the cost of rewriting legacy applications from scratch is considered prohibitive both in economic and technical terms. 2. The need to run on conventional parallel computers that view a “parallel program” at a low level—as consisting of sequential processes that frequently synchronize and communicate with each other using some form of message passing.


2008 ◽  
Vol 16 (2-3) ◽  
pp. 167-181 ◽  
Author(s):  
Brian J.N. Wylie ◽  
Markus Geimer ◽  
Felix Wolf

Developers of applications with large-scale computing requirements are currently presented with a variety of high-performance systems optimised for message-passing, however, effectively exploiting the available computing resources remains a major challenge. In addition to fundamental application scalability characteristics, application and system peculiarities often only manifest at extreme scales, requiring highly scalable performance measurement and analysis tools that are convenient to incorporate in application development and tuning activities. We present our experiences with a multigrid solver benchmark and state-of-the-art real-world applications for numerical weather prediction and computational fluid dynamics, on three quite different multi-thousand-processor supercomputer systems – Cray XT3/4, MareNostrum & Blue Gene/L – using the newly-developed SCALASCA toolset to quantify and isolate a range of significant performance issues.


Sign in / Sign up

Export Citation Format

Share Document