parallel performance Latest Research Papers

Accounting Start-up Time of Parallel Processes in Amdahl’s Law

Parallel Processing Letters ◽

10.1142/s0129626421500262 ◽

2021 ◽

Author(s):

Evgeny Eremin

Keyword(s):

Parallel Processing ◽

Simple Formula ◽

Conventional Form ◽

Parallel Processes ◽

Amdahl’S Law ◽

Parallel Performance ◽

Start Up ◽

Two Factors ◽

Optimistic Picture

The conventional form of Amdahl’s law states that speedup of calculations in a multiprocessor machine is limited by the definite constant value just due to the existence of some non-parallelizable part in any algorithm. This brief paper considers one more general reason, which prevents a growth of parallel performance: processes that implement distributed task cannot start simultaneously and hence every process adds some start-up time, also reducing by that the gain from a parallel processing. The simple formula, proposed here to extend Amdahl’s law, leads to a less optimistic picture in comparison with classical results: for large amount of processor units the modified law does not approach to constant but vanishes. This is the result of competition between two factors: decreasing of calculation duty and increasing of start-up time when a number of parallel processes grows. The effect may be subdued by means of specific regularity in launching parallel processes.

Download Full-text

Message-Passing-Interface MPI Parallelization of Iteratively Coupled Fluid Flow and Geomechanics Codes for the Simulation of System Behavior in Hydrate-Bearing Geologic Media

10.2118/206161-ms ◽

2021 ◽

Author(s):

Jiecheng Zhang ◽

George Moridis ◽

Thomas Blasingame

Keyword(s):

Message Passing ◽

High Performance ◽

Message Passing Interface ◽

Gas Production ◽

Parallel Performance ◽

Group 3 ◽

Linux Cluster ◽

Group 2 ◽

3D Problem ◽

Group 1

Abstract The Reservoir GeoMechanics Simulator (RGMS), a geomechanics simulator based on the finite element method and parallelized using the Message Passing Interface (MPI), is developed in this work to model the stresses and deformations in subsurface systems. RGMS can be used stand-alone, or coupled with flow and transport models. pT+H V1.5, a parallel MPI-based version of the serial T+H V1.5 code that describes mass and heat flow in hydrate-bearing porous media, is also developed. Using the fixed-stress split iterative scheme, RGMS is coupled with the pT+H V1.5 to investigate the geomechanical responses associated with gas production from hydrate accumulations. The code development and testing process involve evaluation of the parallelization and of the coupling method, as well as verification and validation of the results. The parallel performance of the codes is tested on the Ada Linux cluster of the Texas A&M High Performance Research Computing using up to 512 processors, and on a Mac Pro computer with 12 processors. The investigated problems are: Group 1: Geomechanical problems solved by RGMS in 2D Cartesian and cylindrical domains and a 3D problem, involving 4x106 and 3.375 x106 elements, respectively; Group 2: Realistic problems of gas production from hydrates using pT+H V1.5 in 2D and 3D systems with 2.45x105 and 3.6 x106 elements, respectively; Group 3: The 3D problem in Group 2 solved with the coupled RGMS-pT+H V1.5 simulator, fully accounting for geomechanics. Two domain partitioning options are investigated on the Ada Linux cluster and the Mac Pro, and the code parallel performance is monitored. On the Ada Linux cluster using 512 processors, the simulation speedups (a) of RGMS are 218.89, 188.13, and 284.70 in the Group 1 problems, (b) of pT+H V1.5 are 174.25 and 341.67 in the Group 2 cases, and (c) of the coupled simulators is 331.80 in Group 3. The results produced in this work show the necessity of using full geomechanics simulators in marine hydrate-related studies because of the associated pronounced geomechanical effects on production and displacements and (b) the effectiveness of the parallel simulators developed in this study, which can be the only realistic option in these complex simulations of large multi-dimensional domains.

Download Full-text

Efficient tree-traversals: reconciling parallelism and dense data representations

Proceedings of the ACM on Programming Languages ◽

10.1145/3473596 ◽

2021 ◽

Vol 5 (ICFP) ◽

pp. 1-29

Author(s):

Chaitanya Koparkar ◽

Mike Rainey ◽

Michael Vollmer ◽

Milind Kulkarni ◽

Ryan R. Newton

Keyword(s):

Constant Factor ◽

Abstract Syntax ◽

Sequential Programs ◽

Abstract Syntax Tree ◽

Data Formats ◽

Parallel Performance ◽

Data Representations ◽

Memory Representations ◽

Tree Traversals ◽

Dense Data

Recent work showed that compiling functional programs to use dense, serialized memory representations for recursive algebraic datatypes can yield significant constant-factor speedups for sequential programs. But serializing data in a maximally dense format consequently serializes the processing of that data, yielding a tension between density and parallelism. This paper shows that a disciplined, practical compromise is possible. We present Parallel Gibbon, a compiler that obtains the benefits of dense data formats and parallelism. We formalize the semantics of the parallel location calculus underpinning this novel implementation strategy, and show that it is type-safe. Parallel Gibbon exceeds the parallel performance of existing compilers for purely functional programs that use recursive algebraic datatypes, including, notably, abstract-syntax-tree traversals as in compilers.

Download Full-text

A parallelization model for performance characterization of Spark Big Data jobs on Hadoop clusters

Journal Of Big Data ◽

10.1186/s40537-021-00499-7 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

N. Ahmed ◽

Andre L. C. Barczak ◽

Mohammad A. Rashid ◽

Teo Susnjak

Keyword(s):

Big Data ◽

Empirical Data ◽

Performance Model ◽

Problem Size ◽

Parallel Performance ◽

Big Data Applications ◽

Proposed Model ◽

Performance Patterns ◽

Hadoop Clusters ◽

Hadoop Cluster

AbstractThis article proposes a new parallel performance model for different workloads of Spark Big Data applications running on Hadoop clusters. The proposed model can predict the runtime for generic workloads as a function of the number of executors, without necessarily knowing how the algorithms were implemented. For a certain problem size, it is shown that a model based on serial boundaries for a 2D arrangement of executors can fit the empirical data for various workloads. The empirical data was obtained from a real Hadoop cluster, using Spark and HiBench. The workloads used in this work were included WordCount, SVM, Kmeans, PageRank and Graph (Nweight). A particular runtime pattern emerged when adding more executors to run a job. For some workloads, the runtime was longer with more executors added. This phenomenon is predicted with the new model of parallelisation. The resulting equation from the model explains certain performance patterns that do not fit Amdahl’s law predictions, nor Gustafson’s equation. The results show that the proposed model achieved the best fit with all workloads and most of the data sizes, using the R-squared metric for the accuracy of the fitting of empirical data. The proposed model has advantages over machine learning models due to its simplicity, requiring a smaller number of experiments to fit the data. This is very useful to practitioners in the area of Big Data because they can predict runtime of specific applications by analysing the logs. In this work, the model is limited to changes in the number of executors for a fixed problem size.

Download Full-text

On Parallel Performance Practices: Some Observations on Personalizing DMIs as Percussionists

10.21428/92fbeb44.c61b9546 ◽

2021 ◽

Author(s):

Timothy Roth ◽

Aiyun Huang ◽

Tyler Cunningham

Keyword(s):

Performance Practices ◽

Parallel Performance

Download Full-text

Parallel performance analysis of nonlinear equations algorithm based on hybrid genetic algorithm

Journal of Physics Conference Series ◽

10.1088/1742-6596/1982/1/012108 ◽

2021 ◽

Vol 1982 (1) ◽

pp. 012108

Author(s):

Yunwen Yang

Keyword(s):

Genetic Algorithm ◽

Performance Analysis ◽

Nonlinear Equations ◽

Hybrid Genetic Algorithm ◽

Parallel Performance

Download Full-text

Accelerating RepeatClassifier Based on Spark and Greedy Algorithm with Dynamic Upper Boundary

10.1101/2021.06.03.446998 ◽

2021 ◽

Author(s):

Kang Hu ◽

Xingyu Liao ◽

You Zou ◽

Jianxin Wang

Keyword(s):

Transposable Elements ◽

Greedy Algorithm ◽

Repetitive Sequences ◽

Length Distribution ◽

Genome Sequences ◽

Protein Database ◽

Parallel Performance ◽

Data Division ◽

Classification Tool ◽

Upper Boundary

Transposable elements (TEs) represent quantitatively important components of genome sequences (e.g. 90% of the wheat genome), and play important roles in genome organization and evolution. The promotion of unsupervised annotation of transposable elements is of great significance. Classification is an important step in TE annotation, which summarize the information about the type or mechanism for the raw repetitive sequences. RepeatClassifier is a basic homology-based classification tool which compares the TE families to both the Repeat Protein Database (DB) and libraries of RepeatMasker. Unfortunately, RepeatClassifier is inefficient and takes a few days to classify the repetitive sequences of large genomes. Hence, we proposed Spark-based RepeatClassifier (SRC) which uses Greedy Algorithm with Dynamic Upper Boundary (GDUB) for data division and load balancing, and Spark to improve the parallelism of RepeatClassifier. Experimental results show that SRC can not only ensure the same level of accuracy as that of RepeatClassifier, but also achieve 42-88 times of acceleration compared to RepeatClassifier. At the same time, SRC shows excellent parallel performance when dealing with input datasets with unbalanced length distribution.

Download Full-text

Configurable simultaneously single-threaded (multi-)engine processor

10.32920/ryerson.14644953.v1 ◽

2021 ◽

Author(s):

Anita Tino

Keyword(s):

Energy Efficiency ◽

Power Dissipation ◽

Single Thread ◽

Parallel Performance ◽

Area Overhead ◽

Expected Performance ◽

Thread Level Parallelism ◽

Performance Gains ◽

Level Parallelism ◽

Additional Area

As the multi-core computing era continues to progress, the need to increase single- thread performance, throughput, and seemingly adapt to thread-level parallelism (TLP) remain important issues. Though the number of cores on each processor continues to increase, expected performance gains have lagged. Accordingly, com- puting systems often include Simultaneously Multi-Threaded (SMT) processors as a compromise between sequential and parallel performance on a single core. These processors effectively improve the throughput and utilization of a core, however often at the expense of single-thread performance as threads per core scale. Accordingly, applications which require higher single-thread performance must often resort to single-thread core multi-processor systems which incur additional area overhead and power dissipation. In attempts to improve single- and multi-thread core efficiency, this work introduces the concept of a Configurable Simultaneously Single-Threaded (Multi-)Engine Processor (ConSSTEP). ConSSTEP is a nuanced approach to multi- threaded processors, achieving performance gains and energy efficiency by invoking low overhead reconfigurable properties with full software compatibility. Experimen- tal results demonstrate that ConSSTEP is able to increase single-thread Instruc- tions Per Cycle (IPC) up to 1.39x and 2.4x for 2-thread and 4-thread workloads, respectively, improving throughput and providing up to 2x energy efficiency when compared to a conventional SMT processor.

Download Full-text

Configurable simultaneously single-threaded (multi-)engine processor

10.32920/ryerson.14644953 ◽

2021 ◽

Author(s):

Anita Tino

Keyword(s):

Energy Efficiency ◽

Power Dissipation ◽

Single Thread ◽

Parallel Performance ◽

Area Overhead ◽

Expected Performance ◽

Thread Level Parallelism ◽

Performance Gains ◽

Level Parallelism ◽

Additional Area

As the multi-core computing era continues to progress, the need to increase single- thread performance, throughput, and seemingly adapt to thread-level parallelism (TLP) remain important issues. Though the number of cores on each processor continues to increase, expected performance gains have lagged. Accordingly, com- puting systems often include Simultaneously Multi-Threaded (SMT) processors as a compromise between sequential and parallel performance on a single core. These processors effectively improve the throughput and utilization of a core, however often at the expense of single-thread performance as threads per core scale. Accordingly, applications which require higher single-thread performance must often resort to single-thread core multi-processor systems which incur additional area overhead and power dissipation. In attempts to improve single- and multi-thread core efficiency, this work introduces the concept of a Configurable Simultaneously Single-Threaded (Multi-)Engine Processor (ConSSTEP). ConSSTEP is a nuanced approach to multi- threaded processors, achieving performance gains and energy efficiency by invoking low overhead reconfigurable properties with full software compatibility. Experimen- tal results demonstrate that ConSSTEP is able to increase single-thread Instruc- tions Per Cycle (IPC) up to 1.39x and 2.4x for 2-thread and 4-thread workloads, respectively, improving throughput and providing up to 2x energy efficiency when compared to a conventional SMT processor.

Download Full-text

Inherent Parallelism and Speedup Estimation of Sequential Programs

Annals of Emerging Technologies in Computing ◽

10.33166/aetic.2021.02.006 ◽

2021 ◽

Vol 5 (2) ◽

pp. 62-77

Author(s):

Sesha Kalyur ◽

Nagaraja G.S

Keyword(s):

Research Work ◽

Performance Estimation ◽

Sequential Programs ◽

Parallel Class ◽

Parallel Performance ◽

Effective Performance ◽

Parallel Speedup ◽

High Level ◽

Performance Estimates ◽

Parallel Estimation

Although several automated Parallel Conversion solutions are available, very few have attempted, to provide proper estimates of the available Inherent Parallelism and expected Parallel Speedup. CALIPER which is the outcome of this research work is a parallel performance estimation technology that can fill this void. High level language structures such as Functions, Loops, Conditions, etc which ease program development, can be a hindrance for effective performance analysis. We refer to these program structures as the Program Shape. As a preparatory step, CALIPER attempts to remove these shape related hindrances, an activity we refer to as Program Shape Flattening. Programs are also characterized by dependences that exist between different instructions and impose an upper limit on the parallel conversion gains. For parallel estimation, we first group instructions that share dependences, and add them to a class we refer to as Dependence Class or Parallel Class. While instructions belonging to a class run sequentially, the classes themselves run in parallel. Parallel runtime, is now the runtime of the class that runs the longest. We report performance estimates of parallel conversion as two metrics. The inherent parallelism in the program is reported, as Maximum Available Parallelism (MAP) and the speedup after conversion as Speedup After Parallelization (SAP).

Download Full-text

parallel performance
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Accounting Start-up Time of Parallel Processes in Amdahl’s Law

Message-Passing-Interface MPI Parallelization of Iteratively Coupled Fluid Flow and Geomechanics Codes for the Simulation of System Behavior in Hydrate-Bearing Geologic Media

Efficient tree-traversals: reconciling parallelism and dense data representations

A parallelization model for performance characterization of Spark Big Data jobs on Hadoop clusters

On Parallel Performance Practices: Some Observations on Personalizing DMIs as Percussionists

Parallel performance analysis of nonlinear equations algorithm based on hybrid genetic algorithm

Accelerating RepeatClassifier Based on Spark and Greedy Algorithm with Dynamic Upper Boundary

Configurable simultaneously single-threaded (multi-)engine processor

Configurable simultaneously single-threaded (multi-)engine processor

Inherent Parallelism and Speedup Estimation of Sequential Programs

Export Citation Format

parallel performanceRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Accounting Start-up Time of Parallel Processes in Amdahl’s Law

Message-Passing-Interface MPI Parallelization of Iteratively Coupled Fluid Flow and Geomechanics Codes for the Simulation of System Behavior in Hydrate-Bearing Geologic Media

Efficient tree-traversals: reconciling parallelism and dense data representations

A parallelization model for performance characterization of Spark Big Data jobs on Hadoop clusters

On Parallel Performance Practices: Some Observations on Personalizing DMIs as Percussionists

Parallel performance analysis of nonlinear equations algorithm based on hybrid genetic algorithm

Accelerating RepeatClassifier Based on Spark and Greedy Algorithm with Dynamic Upper Boundary

Configurable simultaneously single-threaded (multi-)engine processor

Configurable simultaneously single-threaded (multi-)engine processor

Inherent Parallelism and Speedup Estimation of Sequential Programs

parallel performance
Recently Published Documents