Parallel Performance Model for Vertex Repositioning Algorithms and Application to Mesh Partitioning

With availability of large-scale parallel platforms comprised of tens-of-thousands of processors and beyond, there is significant impetus for the development of scalable parallel sparse linear system solvers and preconditioners. An integral part of this design process is the development of performance models capable of predicting performance and providing accurate cost models for the solvers and preconditioners. There has been some work in the past on characterizing performance of the iterative solvers themselves. In this paper, we investigate the problem of characterizing performance and scalability of banded preconditioners. Recent work has demonstrated the superior convergence properties and robustness of banded preconditioners, compared to state-of-the-art ILU family of preconditioners as well as algebraic multigrid preconditioners. Furthermore, when used in conjunction with efficient banded solvers, banded preconditioners are capable of significantly faster time-to-solution. Our banded solver, the Truncated Spike algorithm is specifically designed for parallel performance and tolerance to deep memory hierarchies. Its regular structure is also highly amenable to accurate performance characterization. Using these characteristics, we derive the following results in this paper: (i) we develop parallel formulations of the Truncated Spike solver, (ii) we develop a highly accurate pseudo-analytical parallel performance model for our solver, (iii) we show excellent predication capabilities of our model – based on which we argue the high scalability of our solver. Our pseudo-analytical performance model is based on analytical performance characterization of each phase of our solver. These analytical models are then parameterized using actual runtime information on target platforms. An important consequence of our performance models is that they reveal underlying performance bottlenecks in both serial and parallel formulations. All of our results are validated on diverse heterogeneous multiclusters – platforms for which performance prediction is particularly challenging. Finally, we provide predict the scalability of the Spike algorithm using up to 65,536 cores with our model. In this paper we extend the results presented in the Ninth International Symposium on Parallel and Distributed Computing.

Download Full-text

A parallelization model for performance characterization of Spark Big Data jobs on Hadoop clusters

Journal Of Big Data ◽

10.1186/s40537-021-00499-7 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

N. Ahmed ◽

Andre L. C. Barczak ◽

Mohammad A. Rashid ◽

Teo Susnjak

Keyword(s):

Big Data ◽

Empirical Data ◽

Performance Model ◽

Problem Size ◽

Parallel Performance ◽

Big Data Applications ◽

Proposed Model ◽

Performance Patterns ◽

Hadoop Clusters ◽

Hadoop Cluster

AbstractThis article proposes a new parallel performance model for different workloads of Spark Big Data applications running on Hadoop clusters. The proposed model can predict the runtime for generic workloads as a function of the number of executors, without necessarily knowing how the algorithms were implemented. For a certain problem size, it is shown that a model based on serial boundaries for a 2D arrangement of executors can fit the empirical data for various workloads. The empirical data was obtained from a real Hadoop cluster, using Spark and HiBench. The workloads used in this work were included WordCount, SVM, Kmeans, PageRank and Graph (Nweight). A particular runtime pattern emerged when adding more executors to run a job. For some workloads, the runtime was longer with more executors added. This phenomenon is predicted with the new model of parallelisation. The resulting equation from the model explains certain performance patterns that do not fit Amdahl’s law predictions, nor Gustafson’s equation. The results show that the proposed model achieved the best fit with all workloads and most of the data sizes, using the R-squared metric for the accuracy of the fitting of empirical data. The proposed model has advantages over machine learning models due to its simplicity, requiring a smaller number of experiments to fit the data. This is very useful to practitioners in the area of Big Data because they can predict runtime of specific applications by analysing the logs. In this work, the model is limited to changes in the number of executors for a fixed problem size.

Download Full-text

Driver performance model: 1. Conceptual framework

PsycEXTRA Dataset ◽

10.1037/e447302006-001 ◽

2001 ◽

Author(s):

Joseph M. Heimerl

Keyword(s):

Conceptual Framework ◽

Performance Model ◽

Driver Performance

Download Full-text

Family Identification with the Firm, Non-family Stakeholders Orientation, and Economic Performance--Model

PsycTESTS Dataset ◽

10.1037/t76582-000 ◽

2020 ◽

Author(s):

Mª de la Cruz Déniz‐Déniz ◽

Mª Katiuska Cabrera-Suárez ◽

Josefa D. Martín-Santana

Keyword(s):

Economic Performance ◽

Performance Model

Download Full-text

Environment-Usage-Performance Model on Participating Firms of Electronic Marketplace in Export Marketing

International Commerce and Information Review ◽

10.15798/kaici.9.1.200703.119 ◽

2007 ◽

Vol 9 (1) ◽

pp. 119-148

Author(s):

정찬근 ◽

곽수영

Keyword(s):

Performance Model ◽

Electronic Marketplace ◽

Export Marketing

Download Full-text

Applied Research and Analysis of Construction of Foreign Language Learning Performance Model

2018 International Conference on Social Sciences, Education and Management (SOCSEM 2018) ◽

10.25236/socsem.2018.84 ◽

2018 ◽

Keyword(s):

Foreign Language ◽

Language Learning ◽

Applied Research ◽

Foreign Language Learning ◽

Performance Model ◽

Learning Performance

Download Full-text

Double Precision Is Not Needed for Many-Body Calculations: New Conventional Wisdom

10.26434/chemrxiv.6104804.v1 ◽

2018 ◽

Author(s):

Pavel Pokhilko ◽

Evgeny Epifanovsky ◽

Anna I. Krylov

Keyword(s):

Large Scale ◽

Computation Time ◽

Coupled Cluster ◽

Double Precision ◽

Many Body ◽

Single Precision ◽

Parallel Performance ◽

Point Representation ◽

Electron Repulsion Integrals ◽

Cluster Methods

Using single precision floating point representation reduces the size of data and computation time by a factor of two relative to double precision conventionally used in electronic structure programs. For large-scale calculations, such as those encountered in many-body theories, reduced memory footprint alleviates memory and input/output bottlenecks. Reduced size of data can lead to additional gains due to improved parallel performance on CPUs and various accelerators. However, using single precision can potentially reduce the accuracy of computed observables. Here we report an implementation of coupled-cluster and equation-of-motion coupled-cluster methods with single and double excitations in single precision. We consider both standard implementation and one using Cholesky decomposition or resolution-of-the-identity of electron-repulsion integrals. Numerical tests illustrate that when single precision is used in correlated calculations, the loss of accuracy is insignificant and pure single-precision implementation can be used for computing energies, analytic gradients, excited states, and molecular properties. In addition to pure single-precision calculations, our implementation allows one to follow a single-precision calculation by clean-up iterations, fully recovering double-precision results while retaining significant savings.

Download Full-text