Impact study of data locality on task-based applications through the Heteroprio scheduler

10.7287/peerj.preprints.27616 ◽

2019 ◽

Author(s):

Bérenger Bramas

Keyword(s):

Heterogeneous Computing ◽

Fast Multipole Method ◽

Data Locality ◽

Processing Unit ◽

Memory Transfer ◽

Task Distribution ◽

Significant Performance ◽

The Right ◽

Dynamic Scheduler ◽

Simple Heuristics

The task-based approach has gained much attention to use modern heterogeneous computing nodes. It allows parallelizing with an abstraction of the hardware by delegating task distribution and load balancing to a dynamic scheduler. In this organization, the scheduler is the most critical component that solves the DAG-scheduling problem in order to select the right processing unit for the computation of each task. In this work, we extend our Heteroprio scheduler that was originally created to execute the fast multipole method on multi-GPUs nodes. We improve Heteroprio by taking into account data locality during task assignation. The main principle is to use different task-lists for the different memory nodes and to investigate how locality affinity between the tasks and the different memory nodes can be evaluated without looking at the tasks' dependencies. The interest of the present method was evaluated on two linear algebra applications and a stencil code. It was deduced that simple heuristics can provide significant performance improvement and cut by more than half the total memory transfer of an execution.

Download Full-text

Impact study of data locality on task-based applications through the Heteroprio scheduler

10.7287/peerj.preprints.27616v1 ◽

2019 ◽

Author(s):

Bérenger Bramas

Keyword(s):

Heterogeneous Computing ◽

Fast Multipole Method ◽

Data Locality ◽

Processing Unit ◽

Memory Transfer ◽

Task Distribution ◽

Significant Performance ◽

The Right ◽

Dynamic Scheduler ◽

Simple Heuristics

The task-based approach has gained much attention to use modern heterogeneous computing nodes. It allows parallelizing with an abstraction of the hardware by delegating task distribution and load balancing to a dynamic scheduler. In this organization, the scheduler is the most critical component that solves the DAG-scheduling problem in order to select the right processing unit for the computation of each task. In this work, we extend our Heteroprio scheduler that was originally created to execute the fast multipole method on multi-GPUs nodes. We improve Heteroprio by taking into account data locality during task assignation. The main principle is to use different task-lists for the different memory nodes and to investigate how locality affinity between the tasks and the different memory nodes can be evaluated without looking at the tasks' dependencies. The interest of the present method was evaluated on two linear algebra applications and a stencil code. It was deduced that simple heuristics can provide significant performance improvement and cut by more than half the total memory transfer of an execution.

Download Full-text

Formalizing Data Locality in Task Parallel Applications

Algorithms and Architectures for Parallel Processing - Lecture Notes in Computer Science ◽

10.1007/978-3-319-49956-7_4 ◽

2016 ◽

pp. 43-61 ◽

Cited By ~ 3

Author(s):

Germán Ceballos ◽

Erik Hagersten ◽

David Black-Schaffer

Keyword(s):

Data Locality ◽

Parallel Applications ◽

Task Parallel

Download Full-text

Comparison of Two Methods to Estimate the Maximal Velocity of a Ball during an Overhand Throw

Proceedings ◽

10.3390/proceedings2020049043 ◽

2020 ◽

Vol 49 (1) ◽

pp. 43

Author(s):

Alanna Weisberg ◽

Julie Le Gall ◽

Pro Stergiou ◽

Larry Katz

Keyword(s):

Motion Analysis ◽

Doppler Radar ◽

Performance Indicator ◽

Maximal Velocity ◽

3D Motion ◽

3D Motion Analysis ◽

Ball Velocity ◽

Significant Performance ◽

The Right ◽

The Impact

Maximal ball velocity is a significant performance indicator in many sports, such as baseball. Doppler radar guns are widely assumed to underestimate velocity. Accuracy increases as the cosine angle between the radar gun and the object decreases. The purpose of this study was to investigate the impact of player handedness and the location of the radar gun on the accuracy of ball velocity. Throws were analyzed in four conditions: the radar gun on the right side, throwing with the right arm, then with the left arm; and the radar gun on the left side, throwing with the right arm, then with the left arm. The Cronbach’s alpha for all four conditions showed α-values above 0.97; however, a paired t-test indicated significant differences between the 3D motion analysis and the radar gun. Bland–Altman plots show a high degree of scatter in all conditions. Results suggest that the radar gun measurements can be highly inconsistent when compared to 3D motion analysis.

Download Full-text

IMPROVING SCHEDULING OF COMMUNICATION INTENSIVE PARALLEL APPLICATIONS ON HETEROGENEOUS COMPUTING ENVIRONMENTS

Parallel Processing Letters ◽

10.1142/s0129626405002349 ◽

2005 ◽

Vol 15 (04) ◽

pp. 423-438

Author(s):

RENATO P. ISHII ◽

RODRIGO F. DE MELLO ◽

LUCIANO J. SENGER ◽

MARCOS J. SANTANA ◽

REGINA H. C. SANTANA ◽

...

Keyword(s):

Communication Network ◽

Heterogeneous Computing ◽

Performance Enhancement ◽

Parallel Applications ◽

Group Scheduling ◽

Network Bandwidth ◽

Proposed Model ◽

Message Exchange ◽

Network Operations ◽

Computing Environments

This paper presents a new model for the evaluation of the impacts of processing operations resulting from the communication among processes. This model quantifies the traffic volume imposed on the communication network by means of the latency parameters and the overhead. Such parameters represent the load that each process imposes over the network and the delay on CPU, as a consequence of the network operations. This delay is represented on the model by means of metric measurements slowdown. The equations that quantify the costs involved in the processing operation and message exchange are defined. In the same way, equations to determine the maximum network bandwidth are used in the decision-making scheduling. The proposed model uses a constant that delimitates the communication network maximum allowed usage, this constant defines two possible scheduling techniques: group scheduling or through communication network. Such techniques are incorporated to the DPWP policy, generating an extension of this policy. Experimental and simulation results confirm the performance enhancement of parallel applications under supervision of the extended DPWP policy, compared to the executions supervised by the original DPWP.

Download Full-text

A massively scalable distributed multigrid framework for nonlinear marine hydrodynamics

The International Journal of High Performance Computing Applications ◽

10.1177/1094342019826662 ◽

2019 ◽

Vol 33 (5) ◽

pp. 855-868 ◽

Cited By ~ 1

Author(s):

Stefan Lemvig Glimberg ◽

Allan Peter Engsig-Karup ◽

Luke N Olson

Keyword(s):

Potential Flow ◽

Large Scale ◽

Data Locality ◽

High Order ◽

Domain Model ◽

Processing Unit ◽

Parallel Scalability ◽

Marine Hydrodynamics ◽

Practical Applications ◽

Oak Ridge

The focus of this article is on the parallel scalability of a distributed multigrid framework, known as the DTU Compute GPUlab Library, for execution on graphics processing unit (GPU)-accelerated supercomputers. We demonstrate near-ideal weak scalability for a high-order fully nonlinear potential flow (FNPF) time domain model on the Oak Ridge Titan supercomputer, which is equipped with a large number of many-core CPU-GPU nodes. The high-order finite difference scheme for the solver is implemented to expose data locality and scalability, and the linear Laplace solver is based on an iterative multilevel preconditioned defect correction method designed for high-throughput processing and massive parallelism. In this work, the FNPF discretization is based on a multi-block discretization that allows for large-scale simulations. In this setup, each grid block is based on a logically structured mesh with support for curvilinear representation of horizontal block boundaries to allow for an accurate representation of geometric features such as surface-piercing bottom-mounted structures—for example, mono-pile foundations as demonstrated. Unprecedented performance and scalability results are presented for a system of equations that is historically known as being too expensive to solve in practical applications. A novel feature of the potential flow model is demonstrated, being that a modest number of multigrid restrictions is sufficient for fast convergence, improving overall parallel scalability as the coarse grid problem diminishes. In the numerical benchmarks presented, we demonstrate using 8192 modern Nvidia GPUs enabling large-scale and high-resolution nonlinear marine hydrodynamics applications.

Download Full-text

Performance-Aware Scheduling of Parallel Applications on Non-Dedicated Clusters

Electronics ◽

10.3390/electronics8090982 ◽

2019 ◽

Vol 8 (9) ◽

pp. 982 ◽

Cited By ~ 1

Author(s):

Alberto Cascajo ◽

David E. Singh ◽

Jesus Carretero

Keyword(s):

Large Scale ◽

Job Scheduling ◽

Parallel Applications ◽

Data Staging ◽

Performance Improvements ◽

Practical Evaluation ◽

Significant Performance ◽

Scalable Monitoring ◽

And Control ◽

New Strategies

This work presents a HPC framework that provides new strategies for resource management and job scheduling, based on executing different applications in shared compute nodes, maximizing platform utilization. The framework includes a scalable monitoring tool that is able to analyze the platform’s compute node utilization. We also introduce an extension of CLARISSE, a middleware for data-staging coordination and control on large-scale HPC platforms that uses the information provided by the monitor in combination with application-level analysis to detect performance degradation in the running applications. This degradation, caused by the fact that the applications share the compute nodes and may compete for their resources, is avoided by means of dynamic application migration. A description of the architecture, as well as a practical evaluation of the proposal, shows significant performance improvements up to 20% in the makespan and 10% in energy consumption compared to a non-optimized execution.

Download Full-text

A mechanism for balancing accuracy and scope in cross-machine black-box GPU performance modeling

The International Journal of High Performance Computing Applications ◽

10.1177/1094342020921340 ◽

2020 ◽

Vol 34 (6) ◽

pp. 589-614

Author(s):

James D Stevens ◽

Andreas Klöckner

Keyword(s):

Performance Optimization ◽

Heterogeneous Computing ◽

Performance Modeling ◽

Matrix Multiplication ◽

Black Box ◽

Ease Of Use ◽

Performance Tuning ◽

Parallel Applications ◽

Accuracy Evaluation ◽

Trade Offs

The ability to model, analyze, and predict execution time of computations is an important building block that supports numerous efforts, such as load balancing, benchmarking, job scheduling, developer-guided performance optimization, and the automation of performance tuning for high performance, parallel applications. In today’s increasingly heterogeneous computing environment, this task must be accomplished efficiently across multiple architectures, including massively parallel coprocessors like GPUs, which are increasingly prevalent in the world’s fastest supercomputers. To address this challenge, we present an approach for constructing customizable, cross-machine performance models for GPU kernels, including a mechanism to automatically and symbolically gather performance-relevant kernel operation counts, a tool for formulating mathematical models using these counts, and a customizable parameterized collection of benchmark kernels used to calibrate models to GPUs in a black-box fashion. With this approach, we empower the user to manage trade-offs between model accuracy, evaluation speed, and generalizability. A user can define their own model and customize the calibration process, making it as simple or complex as desired, and as application-targeted or general as desired. As application examples of our approach, we demonstrate both linear and nonlinear models; these examples are designed to predict execution times for multiple variants of a particular computation: two matrix-matrix multiplication variants, four discontinuous Galerkin differentiation operation variants, and two 2D five-point finite difference stencil variants. For each variant, we present accuracy results on GPUs from multiple vendors and hardware generations. We view this highly user-customizable approach as a response to a central question arising in GPU performance modeling: how can we model GPU performance in a cost-explanatory fashion while maintaining accuracy, evaluation speed, portability, and ease of use, an attribute we believe precludes approaches requiring manual collection of kernel or hardware statistics.

Download Full-text

Homology sequence analysis using GPU acceleration

10.32469/10355/66808 ◽

2018 ◽

Author(s):

◽

Huan Truong

Keyword(s):

Parallel Computing ◽

Heterogeneous Computing ◽

Computational Models ◽

Operational Taxonomic Unit ◽

Classification Problem ◽

Processing Unit ◽

Sequence Motifs ◽

Biological Sequence ◽

Sequencing Data ◽

Central Processing

A number of problems in bioinformatics, systems biology and computational biology field require abstracting physical entities to mathematical or computational models. In such studies, the computational paradigms often involve algorithms that can be solved by the Central Processing Unit (CPU). Historically, those algorithms benefit from the advancements of computing power in the serial processing capabilities of individual CPU cores. However, the growth has slowed down over recent years, as scaling out CPU has been shown to be both cost-prohibitive and insecure. To overcome this problem, parallel computing approaches that employ the Graphics Processing Unit (GPU) have gained attention as complementing or replacing traditional CPU approaches. The premise of this research is to investigate the applicability of various parallel computing platforms to several problems in the detection and analysis of homology in biological sequence. I hypothesize that by exploiting the sheer amount of computation power and sequencing data, it is possible to deduce information from raw sequences without supplying the underlying prior knowledge to come up with an answer. I have developed such tools to perform analysis at scales that are traditionally unattainable with general-purpose CPU platforms. I have developed a method to accelerate sequence alignment on the GPU, and I used the method to investigate whether the Operational Taxonomic Unit (OTU) classification problem can be improved with such sheer amount of computational power. I have developed a method to accelerate pairwise k-mer comparison on the GPU, and I used the method to further develop PolyHomology, a framework to scaffold shared sequence motifs across large numbers of genomes to illuminate the structure of the regulatory network in yeasts. The results suggest that such approach to heterogeneous computing could help to answer questions in biology and is a viable path to new discoveries in the present and the future.

Download Full-text

Smart Irrigation System Using Internet of Things

International Journal of Advanced Research in Science, Communication and Technology ◽

10.48175/ijarsct-1388 ◽

2021 ◽

pp. 302-307

Author(s):

Prof. Vanita Babanne ◽

Amol Kajale ◽

Gaurav Menaria ◽

Manish Kamble ◽

Pranav Mundada

Keyword(s):

Data Storage ◽

Water Loss ◽

Crop Yields ◽

Irrigation System ◽

Irrigation Management ◽

Remote Access ◽

Soil Conditions ◽

Processing Unit ◽

Sensor Unit ◽

The Right

Irrigation forms one of the mainstays of agriculture and food production. As a result of outdated strategies in developing and developing countries, much water is wasted in this process. In this article, we have established a regulatory model of irrigation management to put a check on this waste of water by providing a good irrigation system for farming. The prototype Smart Automatic Irrigation Controller (SAIC) has two operating units, viz. Wireless Sensor Unit and Wireless Information Processing Unit . The purpose of the sensor unit is to measure climate and soil conditions and to calculate the actual water loss due to evapotranspiration. Processing unit considers this calculation and performs the regulatory action required to control workers by delivering the right amount of water to the farm. A combination of basic rules is included in the decision-making table. The model is initially developed and validated in the process of testing the effectiveness. The results obtained showed the potential to compensate for water loss by almost 100%. The regulator experienced a 27% reduction in water use and a 40% increase in crop yields. The prototype is connected to a cloud server for data storage and remote access to control. The device is efficient, inexpensive, and usable so that end users can use it easily and comfortably. The model is new and unique in the sense that it can plan irrigation for all types of crops, in all climatic conditions of all soil types while feeding the right combination of soil growth stage in the inference engine.

Download Full-text