scholarly journals Impact study of data locality on task-based applications through the Heteroprio scheduler

2019 ◽  
Vol 5 ◽  
pp. e190 ◽  
Author(s):  
Bérenger Bramas

The task-based approach has emerged as a viable way to effectively use modern heterogeneous computing nodes. It allows the development of parallel applications with an abstraction of the hardware by delegating task distribution and load balancing to a dynamic scheduler. In this organization, the scheduler is the most critical component that solves the DAG scheduling problem in order to select the right processing unit for the computation of each task. In this work, we extend our Heteroprio scheduler that was originally created to execute the fast multipole method on multi-GPUs nodes. We improve Heteroprio by taking into account data locality during task distribution. The main principle is to use different task-lists for the different memory nodes and to investigate how locality affinity between the tasks and the different memory nodes can be evaluated without looking at the tasks’ dependencies. We evaluate the benefit of our method on two linear algebra applications and a stencil code. We show that simple heuristics can provide significant performance improvement and cut by more than half the total memory transfer of an execution.

2019 ◽  
Author(s):  
Bérenger Bramas

The task-based approach has gained much attention to use modern heterogeneous computing nodes. It allows parallelizing with an abstraction of the hardware by delegating task distribution and load balancing to a dynamic scheduler. In this organization, the scheduler is the most critical component that solves the DAG-scheduling problem in order to select the right processing unit for the computation of each task. In this work, we extend our Heteroprio scheduler that was originally created to execute the fast multipole method on multi-GPUs nodes. We improve Heteroprio by taking into account data locality during task assignation. The main principle is to use different task-lists for the different memory nodes and to investigate how locality affinity between the tasks and the different memory nodes can be evaluated without looking at the tasks' dependencies. The interest of the present method was evaluated on two linear algebra applications and a stencil code. It was deduced that simple heuristics can provide significant performance improvement and cut by more than half the total memory transfer of an execution.


2019 ◽  
Author(s):  
Bérenger Bramas

The task-based approach has gained much attention to use modern heterogeneous computing nodes. It allows parallelizing with an abstraction of the hardware by delegating task distribution and load balancing to a dynamic scheduler. In this organization, the scheduler is the most critical component that solves the DAG-scheduling problem in order to select the right processing unit for the computation of each task. In this work, we extend our Heteroprio scheduler that was originally created to execute the fast multipole method on multi-GPUs nodes. We improve Heteroprio by taking into account data locality during task assignation. The main principle is to use different task-lists for the different memory nodes and to investigate how locality affinity between the tasks and the different memory nodes can be evaluated without looking at the tasks' dependencies. The interest of the present method was evaluated on two linear algebra applications and a stencil code. It was deduced that simple heuristics can provide significant performance improvement and cut by more than half the total memory transfer of an execution.


Proceedings ◽  
2020 ◽  
Vol 49 (1) ◽  
pp. 43
Author(s):  
Alanna Weisberg ◽  
Julie Le Gall ◽  
Pro Stergiou ◽  
Larry Katz

Maximal ball velocity is a significant performance indicator in many sports, such as baseball. Doppler radar guns are widely assumed to underestimate velocity. Accuracy increases as the cosine angle between the radar gun and the object decreases. The purpose of this study was to investigate the impact of player handedness and the location of the radar gun on the accuracy of ball velocity. Throws were analyzed in four conditions: the radar gun on the right side, throwing with the right arm, then with the left arm; and the radar gun on the left side, throwing with the right arm, then with the left arm. The Cronbach’s alpha for all four conditions showed α-values above 0.97; however, a paired t-test indicated significant differences between the 3D motion analysis and the radar gun. Bland–Altman plots show a high degree of scatter in all conditions. Results suggest that the radar gun measurements can be highly inconsistent when compared to 3D motion analysis.


2005 ◽  
Vol 15 (04) ◽  
pp. 423-438
Author(s):  
RENATO P. ISHII ◽  
RODRIGO F. DE MELLO ◽  
LUCIANO J. SENGER ◽  
MARCOS J. SANTANA ◽  
REGINA H. C. SANTANA ◽  
...  

This paper presents a new model for the evaluation of the impacts of processing operations resulting from the communication among processes. This model quantifies the traffic volume imposed on the communication network by means of the latency parameters and the overhead. Such parameters represent the load that each process imposes over the network and the delay on CPU, as a consequence of the network operations. This delay is represented on the model by means of metric measurements slowdown. The equations that quantify the costs involved in the processing operation and message exchange are defined. In the same way, equations to determine the maximum network bandwidth are used in the decision-making scheduling. The proposed model uses a constant that delimitates the communication network maximum allowed usage, this constant defines two possible scheduling techniques: group scheduling or through communication network. Such techniques are incorporated to the DPWP policy, generating an extension of this policy. Experimental and simulation results confirm the performance enhancement of parallel applications under supervision of the extended DPWP policy, compared to the executions supervised by the original DPWP.


Author(s):  
Stefan Lemvig Glimberg ◽  
Allan Peter Engsig-Karup ◽  
Luke N Olson

The focus of this article is on the parallel scalability of a distributed multigrid framework, known as the DTU Compute GPUlab Library, for execution on graphics processing unit (GPU)-accelerated supercomputers. We demonstrate near-ideal weak scalability for a high-order fully nonlinear potential flow (FNPF) time domain model on the Oak Ridge Titan supercomputer, which is equipped with a large number of many-core CPU-GPU nodes. The high-order finite difference scheme for the solver is implemented to expose data locality and scalability, and the linear Laplace solver is based on an iterative multilevel preconditioned defect correction method designed for high-throughput processing and massive parallelism. In this work, the FNPF discretization is based on a multi-block discretization that allows for large-scale simulations. In this setup, each grid block is based on a logically structured mesh with support for curvilinear representation of horizontal block boundaries to allow for an accurate representation of geometric features such as surface-piercing bottom-mounted structures—for example, mono-pile foundations as demonstrated. Unprecedented performance and scalability results are presented for a system of equations that is historically known as being too expensive to solve in practical applications. A novel feature of the potential flow model is demonstrated, being that a modest number of multigrid restrictions is sufficient for fast convergence, improving overall parallel scalability as the coarse grid problem diminishes. In the numerical benchmarks presented, we demonstrate using 8192 modern Nvidia GPUs enabling large-scale and high-resolution nonlinear marine hydrodynamics applications.


Electronics ◽  
2019 ◽  
Vol 8 (9) ◽  
pp. 982 ◽  
Author(s):  
Alberto Cascajo ◽  
David E. Singh ◽  
Jesus Carretero

This work presents a HPC framework that provides new strategies for resource management and job scheduling, based on executing different applications in shared compute nodes, maximizing platform utilization. The framework includes a scalable monitoring tool that is able to analyze the platform’s compute node utilization. We also introduce an extension of CLARISSE, a middleware for data-staging coordination and control on large-scale HPC platforms that uses the information provided by the monitor in combination with application-level analysis to detect performance degradation in the running applications. This degradation, caused by the fact that the applications share the compute nodes and may compete for their resources, is avoided by means of dynamic application migration. A description of the architecture, as well as a practical evaluation of the proposal, shows significant performance improvements up to 20% in the makespan and 10% in energy consumption compared to a non-optimized execution.


Author(s):  
James D Stevens ◽  
Andreas Klöckner

The ability to model, analyze, and predict execution time of computations is an important building block that supports numerous efforts, such as load balancing, benchmarking, job scheduling, developer-guided performance optimization, and the automation of performance tuning for high performance, parallel applications. In today’s increasingly heterogeneous computing environment, this task must be accomplished efficiently across multiple architectures, including massively parallel coprocessors like GPUs, which are increasingly prevalent in the world’s fastest supercomputers. To address this challenge, we present an approach for constructing customizable, cross-machine performance models for GPU kernels, including a mechanism to automatically and symbolically gather performance-relevant kernel operation counts, a tool for formulating mathematical models using these counts, and a customizable parameterized collection of benchmark kernels used to calibrate models to GPUs in a black-box fashion. With this approach, we empower the user to manage trade-offs between model accuracy, evaluation speed, and generalizability. A user can define their own model and customize the calibration process, making it as simple or complex as desired, and as application-targeted or general as desired. As application examples of our approach, we demonstrate both linear and nonlinear models; these examples are designed to predict execution times for multiple variants of a particular computation: two matrix-matrix multiplication variants, four discontinuous Galerkin differentiation operation variants, and two 2D five-point finite difference stencil variants. For each variant, we present accuracy results on GPUs from multiple vendors and hardware generations. We view this highly user-customizable approach as a response to a central question arising in GPU performance modeling: how can we model GPU performance in a cost-explanatory fashion while maintaining accuracy, evaluation speed, portability, and ease of use, an attribute we believe precludes approaches requiring manual collection of kernel or hardware statistics.


2018 ◽  
Author(s):  
◽  
Huan Truong

A number of problems in bioinformatics, systems biology and computational biology field require abstracting physical entities to mathematical or computational models. In such studies, the computational paradigms often involve algorithms that can be solved by the Central Processing Unit (CPU). Historically, those algorithms benefit from the advancements of computing power in the serial processing capabilities of individual CPU cores. However, the growth has slowed down over recent years, as scaling out CPU has been shown to be both cost-prohibitive and insecure. To overcome this problem, parallel computing approaches that employ the Graphics Processing Unit (GPU) have gained attention as complementing or replacing traditional CPU approaches. The premise of this research is to investigate the applicability of various parallel computing platforms to several problems in the detection and analysis of homology in biological sequence. I hypothesize that by exploiting the sheer amount of computation power and sequencing data, it is possible to deduce information from raw sequences without supplying the underlying prior knowledge to come up with an answer. I have developed such tools to perform analysis at scales that are traditionally unattainable with general-purpose CPU platforms. I have developed a method to accelerate sequence alignment on the GPU, and I used the method to investigate whether the Operational Taxonomic Unit (OTU) classification problem can be improved with such sheer amount of computational power. I have developed a method to accelerate pairwise k-mer comparison on the GPU, and I used the method to further develop PolyHomology, a framework to scaffold shared sequence motifs across large numbers of genomes to illuminate the structure of the regulatory network in yeasts. The results suggest that such approach to heterogeneous computing could help to answer questions in biology and is a viable path to new discoveries in the present and the future.


Author(s):  
Prof. Vanita Babanne ◽  
Amol Kajale ◽  
Gaurav Menaria ◽  
Manish Kamble ◽  
Pranav Mundada

Irrigation forms one of the mainstays of agriculture and food production. As a result of outdated strategies in developing and developing countries, much water is wasted in this process. In this article, we have established a regulatory model of irrigation management to put a check on this waste of water by providing a good irrigation system for farming. The prototype Smart Automatic Irrigation Controller (SAIC) has two operating units, viz. Wireless Sensor Unit and Wireless Information Processing Unit . The purpose of the sensor unit is to measure climate and soil conditions and to calculate the actual water loss due to evapotranspiration. Processing unit considers this calculation and performs the regulatory action required to control workers by delivering the right amount of water to the farm. A combination of basic rules is included in the decision-making table. The model is initially developed and validated in the process of testing the effectiveness. The results obtained showed the potential to compensate for water loss by almost 100%. The regulator experienced a 27% reduction in water use and a 40% increase in crop yields. The prototype is connected to a cloud server for data storage and remote access to control. The device is efficient, inexpensive, and usable so that end users can use it easily and comfortably. The model is new and unique in the sense that it can plan irrigation for all types of crops, in all climatic conditions of all soil types while feeding the right combination of soil growth stage in the inference engine.


Sign in / Sign up

Export Citation Format

Share Document