parallel applications
Recently Published Documents


TOTAL DOCUMENTS

1003
(FIVE YEARS 79)

H-INDEX

30
(FIVE YEARS 2)

2022 ◽  
Vol 19 (1) ◽  
pp. 1-25
Author(s):  
Muhammad Aditya Sasongko ◽  
Milind Chabbi ◽  
Mandana Bagheri Marzijarani ◽  
Didem Unat

One widely used metric that measures data locality is reuse distance —the number of unique memory locations that are accessed between two consecutive accesses to a particular memory location. State-of-the-art techniques that measure reuse distance in parallel applications rely on simulators or binary instrumentation tools that incur large performance and memory overheads. Moreover, the existing sampling-based tools are limited to measuring reuse distances of a single thread and discard interactions among threads in multi-threaded programs. In this work, we propose ReuseTracker —a fast and accurate reuse distance analyzer that leverages existing hardware features in commodity CPUs. ReuseTracker is designed for multi-threaded programs and takes cache-coherence effects into account. By utilizing hardware features like performance monitoring units and debug registers, ReuseTracker can accurately profile reuse distance in parallel applications with much lower overheads than existing tools. It introduces only 2.9× runtime and 2.8× memory overheads. Our tool achieves 92% accuracy when verified against a newly developed configurable benchmark that can generate a variety of different reuse distance patterns. We demonstrate the tool’s functionality with two use-case scenarios using PARSEC, Rodinia, and Synchrobench benchmark suites where ReuseTracker guides code refactoring in these benchmarks by detecting spatial reuses in shared caches that are also false sharing and successfully predicts whether some benchmarks in these suites can benefit from adjacent cache line prefetch optimization.


2022 ◽  
Author(s):  
Sandro Rao ◽  
Elisa Demetra Mallemace ◽  
Giuseppe Cocorullo ◽  
Giuliana Faggio ◽  
Giacomo Messina ◽  
...  

Abstract The refractive index and its variation with temperature, i.e. the thermo-optic coefficient, are basic optical parameters for all those semiconductors that are used in the fabrication of linear and non-linear opto-electronic devices and systems. Recently, 4H single-crystal Silicon Carbide (4H-SiC) and Gallium Nitride (GaN) have emerged as excellent building materials for high power and high temperature electronics, and wide parallel applications in photonics can be consequently forecasted in the near future, in particular in the infrared telecommunication band of λ=1500-1600 nm.In this paper, the thermo-optic coefficient (dn/dT) is experimentally measured in 4H-SiC and GaN substrates, from room temperature to 480 K, at the wavelength of 1550 nm. Specifically, the substrates, forming natural Fabry-Perot etalons, are exploited within a simple hybrid fiber–free space optical interferometric system to take accurate measurements of the transmitted optical power in the said temperature range. It is found that, for both semiconductors, dn/dT is itself remarkably temperature dependent, in particular quadratically for GaN and almost linearly for 4H-SiC.


Author(s):  
Matheus S. Serpa ◽  
Eduardo H. M. Cruz ◽  
Matthias Diener ◽  
Arthur F. Lorenzon ◽  
Antonio C. S. Beck ◽  
...  

2021 ◽  
Author(s):  
Claudio Scheer ◽  
Renato B. Hoffmann ◽  
Dalvan Griebler ◽  
Isabel H. Manssour ◽  
Luiz G. Fernandes

Profiling tools are essential to understand the behavior of parallel applications and assist in the optimization process. However, tools such as Perf generate a large amount of data. This way, they require significant storage space, which also complicates reasoning about this large volume of data. Therefore, we propose VisPerf: a tool-chain and an interactive visualization dashboard for Perf data. The VisPerf tool-chain profiles the application and pre-processes the data, reducing the storage space required by about 50 times. Moreover, we used the visualization dashboard to quickly understand the performance of different events and visualize specific threads and functions of a real-world application.


2021 ◽  
Author(s):  
Petros Voudouris ◽  
Per Stenström ◽  
Risat Pathan

AbstractHeterogeneous multiprocessors can offer high performance at low energy expenditures. However, to be able to use them in hard real-time systems, timing guarantees need to be provided, and the main challenge is to determine the worst-case schedule length (also known as makespan) of an application. Previous works that estimate the makespan focus mainly on the independent-task application model or the related multiprocessor model that limits the applicability of the makespan. On the other hand, the directed acyclic graph (DAG) application model and the unrelated multiprocessor model are general and can cover most of today’s platforms and applications. In this work, we propose a simple work-conserving scheduling method of the tasks in a DAG and two new approaches to finding the makespan. A set of representative OpenMP task-based parallel applications from the BOTS benchmark suite and synthetic DAGs are used to evaluate the proposed method. Based on the empirical results, the proposed approach calculates the makespan close to the exhaustive method and with low pessimism compared to a lower bound of the actual makespan calculation.


2021 ◽  
Vol 5 (OOPSLA) ◽  
pp. 1-30
Author(s):  
Tyler Sorensen ◽  
Lucas F. Salvador ◽  
Harmit Raval ◽  
Hugues Evrard ◽  
John Wickerson ◽  
...  

As GPU availability has increased and programming support has matured, a wider variety of applications are being ported to these platforms. Many parallel applications contain fine-grained synchronization idioms; as such, their correct execution depends on a degree of relative forward progress between threads (or thread groups). Unfortunately, many GPU programming specifications (e.g. Vulkan and Metal) say almost nothing about relative forward progress guarantees between workgroups. Although prior work has proposed a spectrum of plausible progress models for GPUs, cross-vendor specifications have yet to commit to any model. This work is a collection of tools and experimental data to aid specification designers when considering forward progress guarantees in programming frameworks. As a foundation, we formalize a small parallel programming language that captures the essence of fine-grained synchronization. We then provide a means of formally specifying a progress model, and develop a termination oracle that decides whether a given program is guaranteed to eventually terminate with respect to a given progress model. Next, we formalize a set of constraints that describe concurrent programs that require forward progress to terminate. This allows us to synthesize a large set of 483 progress litmus tests. Combined with the termination oracle, we can determine the expected status of each litmus test -- i.e. whether it is guaranteed to eventually terminate -- under various progress models. We present a large experimental campaign running the litmus tests across 8 GPUs from 5 different vendors. Our results highlight that GPUs have significantly different termination behaviors under our test suite. Most notably, we find that Apple and ARM GPUs do not support the linear occupancy-bound model, as was hypothesized by prior work.


Author(s):  
Vinicius S. da Silva ◽  
Angelo G. D. Nogueira ◽  
Everton Camargo Lima ◽  
Hiago M. G. A. Rocha ◽  
Matheus S. Serpa ◽  
...  

Author(s):  
Mamadou Diarra ◽  
Telesphore Tiendrebeogo

The advent of Big Data has seen the emergence of new processing and storage challenges. These challenges are often solved by distributed processing. Distributed systems are inherently dynamic and unstable, so it is realistic to expect that some resources will fail during use. Load balancing and task scheduling is an important step in determining the performance of parallel applications. Hence the need to design load balancing algorithms adapted to grid computing. In this paper, we propose a dynamic and hierarchical load balancing strategy at two levels: Intrascheduler load balancing, in order to avoid the use of the large-scale communication network, and interscheduler load balancing, for a load regulation of our whole system. The strategy allows improving the average response time of CLOAK-Reduce application tasks with minimal communication. We first focus on the three performance indicators, namely response time, process latency and running time of MapReduce tasks.


Sign in / Sign up

Export Citation Format

Share Document