parallel performance
Recently Published Documents


TOTAL DOCUMENTS

356
(FIVE YEARS 43)

H-INDEX

22
(FIVE YEARS 2)

Author(s):  
Evgeny Eremin

The conventional form of Amdahl’s law states that speedup of calculations in a multiprocessor machine is limited by the definite constant value just due to the existence of some non-parallelizable part in any algorithm. This brief paper considers one more general reason, which prevents a growth of parallel performance: processes that implement distributed task cannot start simultaneously and hence every process adds some start-up time, also reducing by that the gain from a parallel processing. The simple formula, proposed here to extend Amdahl’s law, leads to a less optimistic picture in comparison with classical results: for large amount of processor units the modified law does not approach to constant but vanishes. This is the result of competition between two factors: decreasing of calculation duty and increasing of start-up time when a number of parallel processes grows. The effect may be subdued by means of specific regularity in launching parallel processes.


2021 ◽  
Author(s):  
Jiecheng Zhang ◽  
George Moridis ◽  
Thomas Blasingame

Abstract The Reservoir GeoMechanics Simulator (RGMS), a geomechanics simulator based on the finite element method and parallelized using the Message Passing Interface (MPI), is developed in this work to model the stresses and deformations in subsurface systems. RGMS can be used stand-alone, or coupled with flow and transport models. pT+H V1.5, a parallel MPI-based version of the serial T+H V1.5 code that describes mass and heat flow in hydrate-bearing porous media, is also developed. Using the fixed-stress split iterative scheme, RGMS is coupled with the pT+H V1.5 to investigate the geomechanical responses associated with gas production from hydrate accumulations. The code development and testing process involve evaluation of the parallelization and of the coupling method, as well as verification and validation of the results. The parallel performance of the codes is tested on the Ada Linux cluster of the Texas A&M High Performance Research Computing using up to 512 processors, and on a Mac Pro computer with 12 processors. The investigated problems are: Group 1: Geomechanical problems solved by RGMS in 2D Cartesian and cylindrical domains and a 3D problem, involving 4x106 and 3.375 x106 elements, respectively; Group 2: Realistic problems of gas production from hydrates using pT+H V1.5 in 2D and 3D systems with 2.45x105 and 3.6 x106 elements, respectively; Group 3: The 3D problem in Group 2 solved with the coupled RGMS-pT+H V1.5 simulator, fully accounting for geomechanics. Two domain partitioning options are investigated on the Ada Linux cluster and the Mac Pro, and the code parallel performance is monitored. On the Ada Linux cluster using 512 processors, the simulation speedups (a) of RGMS are 218.89, 188.13, and 284.70 in the Group 1 problems, (b) of pT+H V1.5 are 174.25 and 341.67 in the Group 2 cases, and (c) of the coupled simulators is 331.80 in Group 3. The results produced in this work show the necessity of using full geomechanics simulators in marine hydrate-related studies because of the associated pronounced geomechanical effects on production and displacements and (b) the effectiveness of the parallel simulators developed in this study, which can be the only realistic option in these complex simulations of large multi-dimensional domains.


2021 ◽  
Vol 5 (ICFP) ◽  
pp. 1-29
Author(s):  
Chaitanya Koparkar ◽  
Mike Rainey ◽  
Michael Vollmer ◽  
Milind Kulkarni ◽  
Ryan R. Newton

Recent work showed that compiling functional programs to use dense, serialized memory representations for recursive algebraic datatypes can yield significant constant-factor speedups for sequential programs. But serializing data in a maximally dense format consequently serializes the processing of that data, yielding a tension between density and parallelism. This paper shows that a disciplined, practical compromise is possible. We present Parallel Gibbon, a compiler that obtains the benefits of dense data formats and parallelism. We formalize the semantics of the parallel location calculus underpinning this novel implementation strategy, and show that it is type-safe. Parallel Gibbon exceeds the parallel performance of existing compilers for purely functional programs that use recursive algebraic datatypes, including, notably, abstract-syntax-tree traversals as in compilers.


2021 ◽  
Vol 8 (1) ◽  
Author(s):  
N. Ahmed ◽  
Andre L. C. Barczak ◽  
Mohammad A. Rashid ◽  
Teo Susnjak

AbstractThis article proposes a new parallel performance model for different workloads of Spark Big Data applications running on Hadoop clusters. The proposed model can predict the runtime for generic workloads as a function of the number of executors, without necessarily knowing how the algorithms were implemented. For a certain problem size, it is shown that a model based on serial boundaries for a 2D arrangement of executors can fit the empirical data for various workloads. The empirical data was obtained from a real Hadoop cluster, using Spark and HiBench. The workloads used in this work were included WordCount, SVM, Kmeans, PageRank and Graph (Nweight). A particular runtime pattern emerged when adding more executors to run a job. For some workloads, the runtime was longer with more executors added. This phenomenon is predicted with the new model of parallelisation. The resulting equation from the model explains certain performance patterns that do not fit Amdahl’s law predictions, nor Gustafson’s equation. The results show that the proposed model achieved the best fit with all workloads and most of the data sizes, using the R-squared metric for the accuracy of the fitting of empirical data. The proposed model has advantages over machine learning models due to its simplicity, requiring a smaller number of experiments to fit the data. This is very useful to practitioners in the area of Big Data because they can predict runtime of specific applications by analysing the logs. In this work, the model is limited to changes in the number of executors for a fixed problem size.


2021 ◽  
Author(s):  
Kang Hu ◽  
Xingyu Liao ◽  
You Zou ◽  
Jianxin Wang

Transposable elements (TEs) represent quantitatively important components of genome sequences (e.g. 90% of the wheat genome), and play important roles in genome organization and evolution. The promotion of unsupervised annotation of transposable elements is of great significance. Classification is an important step in TE annotation, which summarize the information about the type or mechanism for the raw repetitive sequences. RepeatClassifier is a basic homology-based classification tool which compares the TE families to both the Repeat Protein Database (DB) and libraries of RepeatMasker. Unfortunately, RepeatClassifier is inefficient and takes a few days to classify the repetitive sequences of large genomes. Hence, we proposed Spark-based RepeatClassifier (SRC) which uses Greedy Algorithm with Dynamic Upper Boundary (GDUB) for data division and load balancing, and Spark to improve the parallelism of RepeatClassifier. Experimental results show that SRC can not only ensure the same level of accuracy as that of RepeatClassifier, but also achieve 42-88 times of acceleration compared to RepeatClassifier. At the same time, SRC shows excellent parallel performance when dealing with input datasets with unbalanced length distribution.


2021 ◽  
Author(s):  
Anita Tino

As the multi-core computing era continues to progress, the need to increase single- thread performance, throughput, and seemingly adapt to thread-level parallelism (TLP) remain important issues. Though the number of cores on each processor continues to increase, expected performance gains have lagged. Accordingly, com- puting systems often include Simultaneously Multi-Threaded (SMT) processors as a compromise between sequential and parallel performance on a single core. These processors effectively improve the throughput and utilization of a core, however often at the expense of single-thread performance as threads per core scale. Accordingly, applications which require higher single-thread performance must often resort to single-thread core multi-processor systems which incur additional area overhead and power dissipation. In attempts to improve single- and multi-thread core efficiency, this work introduces the concept of a Configurable Simultaneously Single-Threaded (Multi-)Engine Processor (ConSSTEP). ConSSTEP is a nuanced approach to multi- threaded processors, achieving performance gains and energy efficiency by invoking low overhead reconfigurable properties with full software compatibility. Experimen- tal results demonstrate that ConSSTEP is able to increase single-thread Instruc- tions Per Cycle (IPC) up to 1.39x and 2.4x for 2-thread and 4-thread workloads, respectively, improving throughput and providing up to 2x energy efficiency when compared to a conventional SMT processor.


2021 ◽  
Author(s):  
Anita Tino

As the multi-core computing era continues to progress, the need to increase single- thread performance, throughput, and seemingly adapt to thread-level parallelism (TLP) remain important issues. Though the number of cores on each processor continues to increase, expected performance gains have lagged. Accordingly, com- puting systems often include Simultaneously Multi-Threaded (SMT) processors as a compromise between sequential and parallel performance on a single core. These processors effectively improve the throughput and utilization of a core, however often at the expense of single-thread performance as threads per core scale. Accordingly, applications which require higher single-thread performance must often resort to single-thread core multi-processor systems which incur additional area overhead and power dissipation. In attempts to improve single- and multi-thread core efficiency, this work introduces the concept of a Configurable Simultaneously Single-Threaded (Multi-)Engine Processor (ConSSTEP). ConSSTEP is a nuanced approach to multi- threaded processors, achieving performance gains and energy efficiency by invoking low overhead reconfigurable properties with full software compatibility. Experimen- tal results demonstrate that ConSSTEP is able to increase single-thread Instruc- tions Per Cycle (IPC) up to 1.39x and 2.4x for 2-thread and 4-thread workloads, respectively, improving throughput and providing up to 2x energy efficiency when compared to a conventional SMT processor.


2021 ◽  
Vol 5 (2) ◽  
pp. 62-77
Author(s):  
Sesha Kalyur ◽  
Nagaraja G.S

Although several automated Parallel Conversion solutions are available, very few have attempted, to provide proper estimates of the available Inherent Parallelism and expected Parallel Speedup. CALIPER which is the outcome of this research work is a parallel performance estimation technology that can fill this void. High level language structures such as Functions, Loops, Conditions, etc which ease program development, can be a hindrance for effective performance analysis. We refer to these program structures as the Program Shape. As a preparatory step, CALIPER attempts to remove these shape related hindrances, an activity we refer to as Program Shape Flattening. Programs are also characterized by dependences that exist between different instructions and impose an upper limit on the parallel conversion gains. For parallel estimation, we first group instructions that share dependences, and add them to a class we refer to as Dependence Class or Parallel Class. While instructions belonging to a class run sequentially, the classes themselves run in parallel. Parallel runtime, is now the runtime of the class that runs the longest. We report performance estimates of parallel conversion as two metrics. The inherent parallelism in the program is reported, as Maximum Available Parallelism (MAP) and the speedup after conversion as Speedup After Parallelization (SAP).


Sign in / Sign up

Export Citation Format

Share Document