Structural Outlooks for the OTIS-Arrangement Network

Recent studies have revealed that the Optical Transpose Interconnection Systems (OTIS) are promising candidates for future high-performance parallel computers. This paper presents and evaluates a general method for algorithm development on the OTIS-Arrangement network (OTIS-AN) as an example of OTIS network. The proposed method can be used and customized for any other OTIS network. Furthermore, it allows efficient mapping of a wide class of algorithms into the OTIS-AN. This method is based on grids and pipelines as popular structures that support a vast body of parallel applications including linear algebra, divide-and-conquer types of algorithms, sorting, and FFT computation. This study confirms the viability of the OTIS-AN as an attractive alternative for large-scale parallel architectures.

Download Full-text

Statistical and machine learning models for optimizing energy in parallel applications

The International Journal of High Performance Computing Applications ◽

10.1177/1094342019842915 ◽

2019 ◽

Vol 33 (6) ◽

pp. 1079-1097 ◽

Cited By ~ 2

Author(s):

Mark Endrei ◽

Chao Jin ◽

Minh Ngoc Dinh ◽

David Abramson ◽

Heidi Poxon ◽

...

Keyword(s):

Machine Learning ◽

Energy Efficiency ◽

High Performance ◽

Large Scale ◽

Energy Use ◽

Parallel Applications ◽

Learning Models ◽

Trade Off ◽

Time Required ◽

Machine Learning Models

Rising power costs and constraints are driving a growing focus on the energy efficiency of high performance computing systems. The unique characteristics of a particular system and workload and their effect on performance and energy efficiency are typically difficult for application users to assess and to control. Settings for optimum performance and energy efficiency can also diverge, so we need to identify trade-off options that guide a suitable balance between energy use and performance. We present statistical and machine learning models that only require a small number of runs to make accurate Pareto-optimal trade-off predictions using parameters that users can control. We study model training and validation using several parallel kernels and more complex workloads, including Algebraic Multigrid (AMG), Large-scale Atomic Molecular Massively Parallel Simulator, and Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics. We demonstrate that we can train the models using as few as 12 runs, with prediction error of less than 10%. Our AMG results identify trade-off options that provide up to 45% improvement in energy efficiency for around 10% performance loss. We reduce the sample measurement time required for AMG by 90%, from 13 h to 74 min.

Download Full-text

SOME METACOMPUTING EXPERIENCES FOR SCIENTIFIC APPLICATIONS

Parallel Processing Letters ◽

10.1142/s0129626499000232 ◽

1999 ◽

Vol 09 (02) ◽

pp. 243-252 ◽

Cited By ~ 2

Author(s):

O. LARSSON ◽

M. FEIG ◽

L. JOHNSSON

Keyword(s):

Molecular Dynamics ◽

Thin Layer ◽

High Performance ◽

Large Scale ◽

Distributed Memory ◽

Parallel Applications ◽

Globus Toolkit ◽

Performance Measurements ◽

Correct Execution ◽

Application Codes

We demonstrate good metacomputing efficiency and portability for three typical large-scale parallel applications; one molecular dynamics code and two electromagnetics codes. The codes were developed for distributed memory parallel platforms using Fortran77 or Fortran90 with MPI. The performance measurements were made for a testbed of two IBM SPs connected through the vBNS. No change of the application codes were required for correct execution of the codes on the testbed using the Globus Toolkit for the required metacomputing services. However, we observe that for good performance, it may be necessary for MPI codes to make use of overlapped computation and communication. For such MPI codes, a communications library designed for hierarchical or clustered communication can yield very good metacomputing efficiencies when high-performance networks, such as the vBNS or the Abilene networks, such as the vBNS or the Abilene networks, are used for platform connectivity. We demonstrate this by inserting a thin layer between the MPI application and the MPI libraries, providing some clustering of communications between platforms.

Download Full-text

EXECUTION OF SEQUENTIAL AND PARALLEL JAVA BYTECODE IN A METACOMPUTING SYSTEM

Parallel Processing Letters ◽

10.1142/s0129626403001148 ◽

2003 ◽

Vol 13 (01) ◽

pp. 53-64 ◽

Cited By ~ 1

Author(s):

ERIC GAMESS

Keyword(s):

Linear Algebra ◽

Virtual Machine ◽

Message Passing ◽

High Performance ◽

Scientific Computing ◽

Message Passing Interface ◽

Java Virtual Machine ◽

Parallel Applications ◽

Beowulf Cluster ◽

Java Bytecode

In this paper, we address the goal of executing Java parallel applications in a group of nodes of a Beowulf cluster transparently chosen by a metacomputing system oriented to efficient execution of Java bytecode, with support for scientific computing. To this end, we extend the Java virtual machine by providing a message passing interface and quick access to distributed high performance resources. Also, we introduce the execution of parallel linear algebra methods for large objects from sequential Java applications by invoking SPLAM, our parallel linear algebra package.

Download Full-text

COMMUNICATION PERFORMANCE OF LAM/MPI AND MPICH ON A LINUX CLUSTER

Parallel Processing Letters ◽

10.1142/s0129626406002678 ◽

2006 ◽

Vol 16 (03) ◽

pp. 323-334

Author(s):

IGOR ROZMAN ◽

MARJAN ŠTERK ◽

ROMAN TROBEC

Keyword(s):

Computer Simulations ◽

High Performance ◽

Parallel Computers ◽

Parallel Computer ◽

Parallel Applications ◽

Software Library ◽

Communication Performance ◽

Computer Cluster ◽

Parallel Performance ◽

Linux Cluster

High performance parallel computers provide computational rates necessary for computer simulations and intensive computing applications. An important part of a parallel computer program is an MPI software library, which implements communication within parallel applications. Several MPI implementations exist, most widely used among them are LAM/MPI and MPICH. This paper presents results of four basic synthetic tests and two real simulations in LAM/MPI and MPICH environments. Tests were made on a computer cluster composed of 17 dual-processor nodes connected by a toroidal mesh. Results show that on the investigated cluster, LAM outperformed MPICH especially by bidirectional ring communication, and that appropriate trimming of communication parameters significantly contributes to the final parallel performance.

Download Full-text

A Set of Batched Basic Linear Algebra Subprograms and LAPACK Routines

ACM Transactions on Mathematical Software ◽

10.1145/3431921 ◽

2021 ◽

Vol 47 (3) ◽

pp. 1-23

Author(s):

Ahmad Abdelfattah ◽

Timothy Costa ◽

Jack Dongarra ◽

Mark Gates ◽

Azzam Haidar ◽

...

Keyword(s):

Machine Learning ◽

Linear Algebra ◽

High Performance ◽

Large Scale ◽

Floating Point ◽

Equal Size ◽

Hardware Accelerators ◽

Double Precision ◽

Basic Linear Algebra Subprograms ◽

Many Core

This article describes a standard API for a set of Batched Basic Linear Algebra Subprograms (Batched BLAS or BBLAS). The focus is on many independent BLAS operations on small matrices that are grouped together and processed by a single routine, called a Batched BLAS routine. The matrices are grouped together in uniformly sized groups, with just one group if all the matrices are of equal size. The aim is to provide more efficient, but portable, implementations of algorithms on high-performance many-core platforms. These include multicore and many-core CPU processors, GPUs and coprocessors, and other hardware accelerators with floating-point compute facility. As well as the standard types of single and double precision, we also include half and quadruple precision in the standard. In particular, half precision is used in many very large scale applications, such as those associated with machine learning.

Download Full-text

Open | SpeedShop: An Open Source Infrastructure for Parallel Performance Analysis

Scientific Programming ◽

10.1155/2008/713705 ◽

2008 ◽

Vol 16 (2-3) ◽

pp. 105-121 ◽

Cited By ~ 36

Author(s):

Martin Schulz ◽

Jim Galarowicz ◽

Don Maghrak ◽

William Hachfeld ◽

David Montoya ◽

...

Keyword(s):

Performance Analysis ◽

High Performance ◽

Large Scale ◽

Parallel Applications ◽

Set Covering ◽

Command Line ◽

Command Line Interface ◽

Parallel Performance ◽

Tool Set ◽

Integrated Performance

Over the last decades a large number of performance tools has been developed to analyze and optimize high performance applications. Their acceptance by end users, however, has been slow: each tool alone is often limited in scope and comes with widely varying interfaces and workflow constraints, requiring different changes in the often complex build and execution infrastructure of the target application. We started the Open | SpeedShop project about 3 years ago to overcome these limitations and provide efficient, easy to apply, and integrated performance analysis for parallel systems. Open | SpeedShop has two different faces: it provides an interoperable tool set covering the most common analysis steps as well as a comprehensive plugin infrastructure for building new tools. In both cases, the tools can be deployed to large scale parallel applications using DPCL/Dyninst for distributed binary instrumentation. Further, all tools developed within or on top of Open | SpeedShop are accessible through multiple fully equivalent interfaces including an easy-to-use GUI as well as an interactive command line interface reducing the usage threshold for those tools.

Download Full-text

Scheduling Parallel Jobs Using Migration and Consolidation in the Cloud

Mathematical Problems in Engineering ◽

10.1155/2012/695757 ◽

2012 ◽

Vol 2012 ◽

pp. 1-18 ◽

Cited By ~ 4

Author(s):

Xiaocheng Liu ◽

Bin Chen ◽

Xiaogang Qiu ◽

Ying Cai ◽

Kedi Huang

Keyword(s):

Quality Of Service ◽

High Performance ◽

Large Scale ◽

Job Scheduling ◽

Scheduling Algorithm ◽

Parallel Applications ◽

Parallel Job Scheduling ◽

Parallel Job ◽

Job Scheduling Problem

An increasing number of high performance computing parallel applications leverages the power of the cloud for parallel processing. How to schedule the parallel applications to improve the quality of service is the key to the successful host of parallel applications in the cloud. The large scale of the cloud makes the parallel job scheduling more complicated as even simple parallel job scheduling problem is NP-complete. In this paper, we propose a parallel job scheduling algorithm named MEASY. MEASY adopts migration and consolidation to enhance the most popular EASY scheduling algorithm. Our extensive experiments on well-known workloads show that our algorithm takes very good care of the quality of service. For two common parallel job scheduling objectives, our algorithm produces an up to 41.1% and an average of 23.1% improvement on the average response time; an up to 82.9% and an average of 69.3% improvement on the average slowdown. Our algorithm is robust even in terms that it allows inaccurate CPU usage estimation and high migration cost. Our approach involves trivial modification on EASY and requires no additional technique; it is practical and effective in the cloud environment.

Download Full-text

High-Performance Implementation

Multidimensional Programming ◽

10.1093/oso/9780195075977.003.0010 ◽

1995 ◽

Author(s):

E. A. Ashcroft ◽

A. A. Faustini ◽

R. Jaggannathan ◽

W. W. Wadge

Keyword(s):

Message Passing ◽

High Performance ◽

Interconnection Network ◽

Matrix Multiplication ◽

Parallel Computers ◽

Parallel Applications ◽

Low Level ◽

Two Factors ◽

Hybrid Language ◽

Memory Architectures

In Chapter 1, we saw how Lucid could be used to express solutions to standard problems such as sorting and matrix multiplication. One of the unique characteristics of Lucid is not only that it can be used as a programming language but it can also be used as a “composition” language. That is, instead of using Lucid to specify computations, it can be used to express how computation components (expressed in some other language) can be “glued” together to form a coherent application. By doing so, the resulting application can enjoy some of the practical benefits attributable to Lucid such as high performance through exploitation of implicit parallelism and robustness through software fault tolerance. In this chapter, we discuss one such use of Lucid—as part of a hybrid language to construct parallel applications to be executed on conventional parallel computers. A conventional parallel computer either consists of a number of processors each with local memory interconnected by a network (as in distributed memory architectures) or a number of processors that share memory possibly using an interconnection network (as in shared memory architectures). The past decade has seen the advent of conventional parallel computers starting with the Denelcor HEP evolving to the CM-2 and Intel Hypercube and further evolving to the CM-5, Intel Paragon, Cray T3D, and IBM SP-2. Even networks of workstations (or workstation clusters) are seen as low-cost (“poor man’s”) parallel computers. Programming of conventional parallel computers has proven to be far more challenging than had been expected. Part of the reason is the continued use of low-level, explicitly parallel programming models such as PVM [42], Linda [10]. Two factors have fueled the continuing use of such languages despite their limited success. 1. The need to reuse existing sequential code because the cost of rewriting legacy applications from scratch is considered prohibitive both in economic and technical terms. 2. The need to run on conventional parallel computers that view a “parallel program” at a low level—as consisting of sequential processes that frequently synchronize and communicate with each other using some form of message passing.

Download Full-text

Performance Measurement and Analysis of Large-Scale Parallel Applications on Leadership Computing Systems

Scientific Programming ◽

10.1155/2008/632685 ◽

2008 ◽

Vol 16 (2-3) ◽

pp. 167-181 ◽

Cited By ~ 11

Author(s):

Brian J.N. Wylie ◽

Markus Geimer ◽

Felix Wolf

Keyword(s):

Performance Measurement ◽

Message Passing ◽

High Performance ◽

Large Scale ◽

Weather Prediction ◽

Parallel Applications ◽

Application Development ◽

Measurement And Analysis ◽

High Performance Systems ◽

Blue Gene

Developers of applications with large-scale computing requirements are currently presented with a variety of high-performance systems optimised for message-passing, however, effectively exploiting the available computing resources remains a major challenge. In addition to fundamental application scalability characteristics, application and system peculiarities often only manifest at extreme scales, requiring highly scalable performance measurement and analysis tools that are convenient to incorporate in application development and tuning activities. We present our experiences with a multigrid solver benchmark and state-of-the-art real-world applications for numerical weather prediction and computational fluid dynamics, on three quite different multi-thousand-processor supercomputer systems – Cray XT3/4, MareNostrum & Blue Gene/L – using the newly-developed SCALASCA toolset to quantify and isolate a range of significant performance issues.

Download Full-text