COST OPTIMALITY AND PREDICTABILITY OF PARALLEL PROGRAMMING WITH SKELETONS

2003 ◽  
Vol 13 (04) ◽  
pp. 575-587 ◽  
Author(s):  
HOLGER BISCHOF ◽  
SERGEI GORLATCH ◽  
EMANUEL KITZELMANN

Skeletons are reusable, parameterized program components with well-defined semantics and pre-packaged efficient parallel implementation. This paper develops a new, provably cost-optimal implementation of the DS (double-scan) skeleton for programming divide-and-conquer algorithms. Our implementation is based on a novel data structure called plist (pointed list); implementation's performance is estimated using an analytical model. We demonstrate the use of the DS skeleton for parallelizing a tridiagonal system solver and report experimental results for its MPI implementation on a Cray T3E and a Linux cluster: they confirm the performance improvement achieved by the cost-optimal implementation and demonstrate its good predictability by our performance model.

1999 ◽  
Author(s):  
Daniel R. Madill ◽  
David W. L. Wang ◽  
Mennas C. Ching

Abstract Quantization effects can reduce the achievable stiffness of virtual walls. Phase delays introduced by filtering are also an issue. This paper presents a non-linear model-based observer that produces smooth position and velocity estimates with very little delay based on measured position and force signals. Compensation for Coulomb friction and motor saturation is incorporated into the estimator. Use of the estimator in the implementation of a virtual wall yielded higher wall stiffnesses and better performance. Model-based estimator design was possible due to the design of the manipulator. The three degree-of-freedom manipulator employed is direct-drive, gravity-balanced, and dynamically-decoupled with nearly linear dynamics. The robot structure itself is employed as a force sensor, reducing the cost of the device.


Author(s):  
Amanda Bienz ◽  
William D Gropp ◽  
Luke N Olson

Algebraic multigrid (AMG) is often viewed as a scalable [Formula: see text] solver for sparse linear systems. Yet, AMG lacks parallel scalability due to increasingly large costs associated with communication, both in the initial construction of a multigrid hierarchy and in the iterative solve phase. This work introduces a parallel implementation of AMG that reduces the cost of communication, yielding improved parallel scalability. It is common in Message Passing Interface (MPI), particularly in the MPI-everywhere approach, to arrange inter-process communication, so that communication is transported regardless of the location of the send and receive processes. Performance tests show notable differences in the cost of intra- and internode communication, motivating a restructuring of communication. In this case, the communication schedule takes advantage of the less costly intra-node communication, reducing both the number and the size of internode messages. Node-centric communication extends to the range of components in both the setup and solve phase of AMG, yielding an increase in the weak and strong scaling of the entire method.


2018 ◽  
Vol 28 (01) ◽  
pp. 1850004 ◽  
Author(s):  
Weijian Zheng ◽  
Fengguang Song ◽  
Lan Lin ◽  
Zizhong Chen

Implementing parallel software for QR factorizations to achieve scalable performance on massively parallel manycore systems requires a comprehensive design that includes algorithm redesign, efficient runtime systems, synchronization and communication reduction, and analytical performance modeling. This paper presents a piece of tiled communication-avoiding QR factorization software that is able to scale efficiently for matrices with general dimensions. We design a tiled communication-avoiding QR factorization algorithm and implement it with a fully distributed dynamic scheduling runtime system to minimize both synchronization and communication. The whole class of communication-avoiding QR factorization algorithms uses an important parameter of D (i.e., the number of domains), whose best solution is still unknown so far and requires manual tuning and empirical searching to find it. To that end, we introduce a simplified analytical performance model to determine an optimal number of domains D[Formula: see text]. The experimental results show that our new parallel implementation is faster than a state-of-the-art multicore-based numerical library by up to 30%, and faster than ScaLAPACK by up to 30 times with thousands of CPU cores. Furthermore, using the new analytical model to predict an optimal number of domains is as competitive as exhaustive searching, and exhibits an average performance difference of 1%.


2017 ◽  
Vol 79 (4) ◽  
Author(s):  
Suharjito Suharjito ◽  
Adrianus B. Kurnadi

Database for Online Transaction Processing (OLTP) application is used by almost every corporations that has adopted computerisation to support their operational day to day business. Compression in the storage or file-systems layer has not been widely adopted for OLTP database because of the concern that it might decrease database performance. OLTP compression in the database layer is available commercially but it has a significant licence cost that reduces the cost saving of compression. In this research, transparent file-system compression with LZ4, LZJB and ZLE algorithm have been tested to improve performance of OLTP application. Using Swing-bench as the benchmark tool and Oracle database 12c, The result indicated that on OLTP workload, LZJB was the most optimal compression algorithm with performance improvement up to 49% and consistent reduction of maximum response time and CPU utilisation overhead, while LZ4 was the compression with the highest compression ratio and ZLE was the compression with the lowest CPU utilisation overhead. In terms of compression ratio, LZ4 can deliver the highest compression ratio which is 5.32, followed by LZJB, 4.92; and ZLE, 1.76. Furthermore, it is found that there is indeed a risk of reduced performance and/or an increase of maximum response time.


Author(s):  
James R. Browning ◽  
Jon G. McGowan ◽  
James F. Manwell

Although decreases in the cost of energy from utility scale wind turbine generators has made them competitive with conventional forms of utility power generation, further reductions can increase the presence of wind energy in the global energy mix. The cost of energy from a wind turbine can be reduced by increasing the annual energy production, reducing the initial capital cost of the turbine, or doing both. In this study, the cost of energy is estimated for a theoretical 1.5 MW wind turbine utilizing a continuously variable ratio hydrostatic drive train between the rotor and the generator. The estimated cost of energy is then compared to that of a conventional wind turbine of equivalent rated power. The annual energy production is estimated for the theoretical hydrostatic turbine using an assumed wind speed distribution and a turbine power curve resulting from a steady state performance model of the turbine. The initial capital cost of the turbine is estimated using cost models developed for various components unique to the hydrostatic turbine as well as economic parameters and models developed by the National Renewable Energy Lab (NREL) for their 2004 WindPACT advanced wind turbine drive train study. The resulting cost of energy, along with various performance characteristics of interest, are presented and compared to those of the WindPACT baseline turbine intended to represent a conventional utility scale wind turbine.


Author(s):  
Rudranarayan M. Mukherjee

This paper presents a generalization of the divide and conquer algorithm for sensitivity analysis of dynamic multibody systems based on direct differentiation. While similar sensitivity analysis approach has been demonstrated for multi-rigid and multi-flexible systems in tree topologies and a limited set of kinematically closed loop topologies, this paper presents the generalization of these approaches to systems in generalized topologies including many coupled kinematically closed loops. This generalization retains the efficient complexity of the underlying formulations i.e. linear and logarithmic complexity in serial and parallel implementation. Other than the computational efficiency, the advantages of this method include concurrent sensitivity analysis with forward dynamics, no numerical artifacts arising from parametric perturbation and significantly reduced data storage compared to traditional methods. An interesting application of this work in control of multibody systems is discussed.


2000 ◽  
Vol 10 (02n03) ◽  
pp. 239-250 ◽  
Author(s):  
CHRISTOPH A. HERRMANN ◽  
CHRISTIAN LENGAUER

We propose the higher-order functional style for the parallel programming of algorithms. The functional language [Formula: see text], a subset of the language Haskell, facilitates the clean integration of skeletons into a functional program. Skeletons are predefined programming schemata with an efficient parallel implementation. We report on our compiler, which translates [Formula: see text] programs into C+MPI, especially on the design decisions we made. Two small examples, the n queens problem and Karatsuba's polynomial multiplication, are presented to demonstrate the programming comfort and the speedup one can obtain.


Sign in / Sign up

Export Citation Format

Share Document