COST OPTIMALITY AND PREDICTABILITY OF PARALLEL PROGRAMMING WITH SKELETONS

Skeletons are reusable, parameterized program components with well-defined semantics and pre-packaged efficient parallel implementation. This paper develops a new, provably cost-optimal implementation of the DS (double-scan) skeleton for programming divide-and-conquer algorithms. Our implementation is based on a novel data structure called plist (pointed list); implementation's performance is estimated using an analytical model. We demonstrate the use of the DS skeleton for parallelizing a tridiagonal system solver and report experimental results for its MPI implementation on a Cray T3E and a Linux cluster: they confirm the performance improvement achieved by the cost-optimal implementation and demonstrate its good predictability by our performance model.

Download Full-text

A Nonlinear Observer for Minimizing Quantization Effects in Virtual Walls

10.1115/imece1999-0044 ◽

1999 ◽

Author(s):

Daniel R. Madill ◽

David W. L. Wang ◽

Mennas C. Ching

Keyword(s):

Coulomb Friction ◽

Force Sensor ◽

Performance Model ◽

Nonlinear Observer ◽

Direct Drive ◽

Model Based ◽

Measured Position ◽

Three Degree Of Freedom ◽

The Cost ◽

Estimator Design

Abstract Quantization effects can reduce the achievable stiffness of virtual walls. Phase delays introduced by filtering are also an issue. This paper presents a non-linear model-based observer that produces smooth position and velocity estimates with very little delay based on measured position and force signals. Compensation for Coulomb friction and motor saturation is incorporated into the estimator. Use of the estimator in the implementation of a virtual wall yielded higher wall stiffnesses and better performance. Model-based estimator design was possible due to the design of the manipulator. The three degree-of-freedom manipulator employed is direct-drive, gravity-balanced, and dynamically-decoupled with nearly linear dynamics. The robot structure itself is employed as a force sensor, reducing the cost of the device.

Download Full-text

Reducing communication in algebraic multigrid with multi-step node aware communication

The International Journal of High Performance Computing Applications ◽

10.1177/1094342020925535 ◽

2020 ◽

Vol 34 (5) ◽

pp. 547-561

Author(s):

Amanda Bienz ◽

William D Gropp ◽

Luke N Olson

Keyword(s):

Message Passing ◽

Message Passing Interface ◽

Parallel Implementation ◽

Algebraic Multigrid ◽

Sparse Linear Systems ◽

Parallel Scalability ◽

Strong Scaling ◽

The Cost ◽

Communication Schedule ◽

Inter Process Communication

Algebraic multigrid (AMG) is often viewed as a scalable [Formula: see text] solver for sparse linear systems. Yet, AMG lacks parallel scalability due to increasingly large costs associated with communication, both in the initial construction of a multigrid hierarchy and in the iterative solve phase. This work introduces a parallel implementation of AMG that reduces the cost of communication, yielding improved parallel scalability. It is common in Message Passing Interface (MPI), particularly in the MPI-everywhere approach, to arrange inter-process communication, so that communication is transported regardless of the location of the send and receive processes. Performance tests show notable differences in the cost of intra- and internode communication, motivating a restructuring of communication. In this case, the communication schedule takes advantage of the less costly intra-node communication, reducing both the number and the size of internode messages. Node-centric communication extends to the range of components in both the setup and solve phase of AMG, yielding an increase in the weak and strong scaling of the entire method.

Download Full-text

Scaling Up Parallel Computation of Tiled QR Factorizations by a Distributed Scheduling Runtime System and Analytical Modeling

Parallel Processing Letters ◽

10.1142/s0129626418500044 ◽

2018 ◽

Vol 28 (01) ◽

pp. 1850004 ◽

Cited By ~ 1

Author(s):

Weijian Zheng ◽

Fengguang Song ◽

Lan Lin ◽

Zizhong Chen

Keyword(s):

Dynamic Scheduling ◽

Parallel Implementation ◽

Optimal Number ◽

Performance Model ◽

Analytical Performance ◽

Distributed Scheduling ◽

Qr Factorization ◽

Runtime Systems ◽

Runtime System ◽

Factorization Algorithms

Implementing parallel software for QR factorizations to achieve scalable performance on massively parallel manycore systems requires a comprehensive design that includes algorithm redesign, efficient runtime systems, synchronization and communication reduction, and analytical performance modeling. This paper presents a piece of tiled communication-avoiding QR factorization software that is able to scale efficiently for matrices with general dimensions. We design a tiled communication-avoiding QR factorization algorithm and implement it with a fully distributed dynamic scheduling runtime system to minimize both synchronization and communication. The whole class of communication-avoiding QR factorization algorithms uses an important parameter of D (i.e., the number of domains), whose best solution is still unknown so far and requires manual tuning and empirical searching to find it. To that end, we introduce a simplified analytical performance model to determine an optimal number of domains D[Formula: see text]. The experimental results show that our new parallel implementation is faster than a state-of-the-art multicore-based numerical library by up to 30%, and faster than ScaLAPACK by up to 30 times with thousands of CPU cores. Furthermore, using the new analytical model to predict an optimal number of domains is as competitive as exhaustive searching, and exhibits an average performance difference of 1%.

Download Full-text

OLTP PERFORMANCE IMPROVEMENT USING FILE-SYSTEMS LAYER COMPRESSION

Jurnal Teknologi ◽

10.11113/jt.v79.8883 ◽

2017 ◽

Vol 79 (4) ◽

Author(s):

Suharjito Suharjito ◽

Adrianus B. Kurnadi

Keyword(s):

Response Time ◽

Performance Improvement ◽

Compression Ratio ◽

File Systems ◽

Transaction Processing ◽

Maximum Response ◽

Improve Performance ◽

Oracle Database ◽

Maximum Response Time ◽

The Cost

Database for Online Transaction Processing (OLTP) application is used by almost every corporations that has adopted computerisation to support their operational day to day business. Compression in the storage or file-systems layer has not been widely adopted for OLTP database because of the concern that it might decrease database performance. OLTP compression in the database layer is available commercially but it has a significant licence cost that reduces the cost saving of compression. In this research, transparent file-system compression with LZ4, LZJB and ZLE algorithm have been tested to improve performance of OLTP application. Using Swing-bench as the benchmark tool and Oracle database 12c, The result indicated that on OLTP workload, LZJB was the most optimal compression algorithm with performance improvement up to 49% and consistent reduction of maximum response time and CPU utilisation overhead, while LZ4 was the compression with the highest compression ratio and ZLE was the compression with the lowest CPU utilisation overhead. In terms of compression ratio, LZ4 can deliver the highest compression ratio which is 5.32, followed by LZJB, 4.92; and ZLE, 1.76. Furthermore, it is found that there is indeed a risk of reduced performance and/or an increase of maximum response time.

Download Full-text

A parallel implementation of evolutionary divide and conquer for the TSP

10.1049/cp:19951098 ◽

1995 ◽

Cited By ~ 1

Author(s):

C.L. Valenzuela

Keyword(s):

Parallel Implementation ◽

Divide And Conquer

Download Full-text

A Techno-Economic Analysis of a Proposed 1.5 MW Wind Turbine With a Hydrostatic Drive Train

ASME 2009 3rd International Conference on Energy Sustainability, Volume 2 ◽

10.1115/es2009-90096 ◽

2009 ◽

Author(s):

James R. Browning ◽

Jon G. McGowan ◽

James F. Manwell

Keyword(s):

Wind Turbine ◽

Energy Production ◽

Capital Cost ◽

Performance Model ◽

Drive Train ◽

Cost Models ◽

Initial Capital ◽

Cost Of Energy ◽

The Cost ◽

Utility Scale

Although decreases in the cost of energy from utility scale wind turbine generators has made them competitive with conventional forms of utility power generation, further reductions can increase the presence of wind energy in the global energy mix. The cost of energy from a wind turbine can be reduced by increasing the annual energy production, reducing the initial capital cost of the turbine, or doing both. In this study, the cost of energy is estimated for a theoretical 1.5 MW wind turbine utilizing a continuously variable ratio hydrostatic drive train between the rotor and the generator. The estimated cost of energy is then compared to that of a conventional wind turbine of equivalent rated power. The annual energy production is estimated for the theoretical hydrostatic turbine using an assumed wind speed distribution and a turbine power curve resulting from a steady state performance model of the turbine. The initial capital cost of the turbine is estimated using cost models developed for various components unique to the hydrostatic turbine as well as economic parameters and models developed by the National Renewable Energy Lab (NREL) for their 2004 WindPACT advanced wind turbine drive train study. The resulting cost of energy, along with various performance characteristics of interest, are presented and compared to those of the WindPACT baseline turbine intended to represent a conventional utility scale wind turbine.

Download Full-text

Efficient Operational Space Sensitivity Analysis of Dynamic Multibody Systems

Volume 4: 8th International Conference on Multibody Systems, Nonlinear Dynamics, and Control, Parts A and B ◽

10.1115/detc2011-48884 ◽

2011 ◽

Author(s):

Rudranarayan M. Mukherjee

Keyword(s):

Sensitivity Analysis ◽

Data Storage ◽

Multibody Systems ◽

Parallel Implementation ◽

Divide And Conquer ◽

Parametric Perturbation ◽

Divide And Conquer Algorithm ◽

Operational Space ◽

Tree Topologies ◽

Logarithmic Complexity

This paper presents a generalization of the divide and conquer algorithm for sensitivity analysis of dynamic multibody systems based on direct differentiation. While similar sensitivity analysis approach has been demonstrated for multi-rigid and multi-flexible systems in tree topologies and a limited set of kinematically closed loop topologies, this paper presents the generalization of these approaches to systems in generalized topologies including many coupled kinematically closed loops. This generalization retains the efficient complexity of the underlying formulations i.e. linear and logarithmic complexity in serial and parallel implementation. Other than the computational efficiency, the advantages of this method include concurrent sensitivity analysis with forward dynamics, no numerical artifacts arising from parametric perturbation and significantly reduced data storage compared to traditional methods. An interesting application of this work in control of multibody systems is discussed.

Download Full-text

On the Cost-Optimality Trade-off for Service Function Chain Reconfiguration

2019 IEEE 8th International Conference on Cloud Networking (CloudNet) ◽

10.1109/cloudnet47604.2019.9064107 ◽

2019 ◽

Author(s):

Kyoomars Alizadeh Noghani ◽

Andreas Kassler ◽

Javid Taheri

Keyword(s):

Trade Off ◽

Service Function ◽

Service Function Chain ◽

The Cost ◽

Cost Optimality

Download Full-text

Optimal implementation of general divide- and-conquer on the hypercube and related networks

Lecture Notes in Computer Science - Parallel Architectures and Their Efficient Use ◽

10.1007/3-540-56731-3_19 ◽

1993 ◽

pp. 195-206

Author(s):

Ernst W. Mayr ◽

Ralph Werchner

Keyword(s):

Divide And Conquer ◽

Optimal Implementation

Download Full-text

$\mathcal{HDC}$: A HIGHER-ORDER LANGUAGE FOR DIVIDE-AND-CONQUER

Parallel Processing Letters ◽

10.1142/s0129626400000238 ◽

2000 ◽

Vol 10 (02n03) ◽

pp. 239-250 ◽

Cited By ~ 20

Author(s):

CHRISTOPH A. HERRMANN ◽

CHRISTIAN LENGAUER

Keyword(s):

Parallel Programming ◽

Parallel Implementation ◽

Higher Order ◽

Divide And Conquer ◽

Functional Language ◽

Polynomial Multiplication ◽

Functional Program ◽

Design Decisions ◽

Order Language

We propose the higher-order functional style for the parallel programming of algorithms. The functional language [Formula: see text], a subset of the language Haskell, facilitates the clean integration of skeletons into a functional program. Skeletons are predefined programming schemata with an efficient parallel implementation. We report on our compiler, which translates [Formula: see text] programs into C+MPI, especially on the design decisions we made. Two small examples, the n queens problem and Karatsuba's polynomial multiplication, are presented to demonstrate the programming comfort and the speedup one can obtain.

Download Full-text