Mitigating execution unit contention in parallel applications using instruction‐aware mapping

Matlab and Parallel Computing

Image Processing & Communications ◽

10.2478/v10248-012-0048-5 ◽

2012 ◽

Vol 17 (4) ◽

pp. 207-216 ◽

Cited By ~ 5

Author(s):

Magdalena Szymczyk ◽

Piotr Szymczyk

Keyword(s):

Image Processing ◽

Signal Processing ◽

Parallel Computing ◽

Distributed Computing ◽

Control Systems ◽

High Performance ◽

Parallel Applications ◽

Process Simulations ◽

Key Features ◽

Financial Process

Abstract The MATLAB is a technical computing language used in a variety of fields, such as control systems, image and signal processing, visualization, financial process simulations in an easy-to-use environment. MATLAB offers "toolboxes" which are specialized libraries for variety scientific domains, and a simplified interface to high-performance libraries (LAPACK, BLAS, FFTW too). Now MATLAB is enriched by the possibility of parallel computing with the Parallel Computing ToolboxTM and MATLAB Distributed Computing ServerTM. In this article we present some of the key features of MATLAB parallel applications focused on using GPU processors for image processing.

Download Full-text

Statistical and machine learning models for optimizing energy in parallel applications

The International Journal of High Performance Computing Applications ◽

10.1177/1094342019842915 ◽

2019 ◽

Vol 33 (6) ◽

pp. 1079-1097 ◽

Cited By ~ 2

Author(s):

Mark Endrei ◽

Chao Jin ◽

Minh Ngoc Dinh ◽

David Abramson ◽

Heidi Poxon ◽

...

Keyword(s):

Machine Learning ◽

Energy Efficiency ◽

High Performance ◽

Large Scale ◽

Energy Use ◽

Parallel Applications ◽

Learning Models ◽

Trade Off ◽

Time Required ◽

Machine Learning Models

Rising power costs and constraints are driving a growing focus on the energy efficiency of high performance computing systems. The unique characteristics of a particular system and workload and their effect on performance and energy efficiency are typically difficult for application users to assess and to control. Settings for optimum performance and energy efficiency can also diverge, so we need to identify trade-off options that guide a suitable balance between energy use and performance. We present statistical and machine learning models that only require a small number of runs to make accurate Pareto-optimal trade-off predictions using parameters that users can control. We study model training and validation using several parallel kernels and more complex workloads, including Algebraic Multigrid (AMG), Large-scale Atomic Molecular Massively Parallel Simulator, and Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics. We demonstrate that we can train the models using as few as 12 runs, with prediction error of less than 10%. Our AMG results identify trade-off options that provide up to 45% improvement in energy efficiency for around 10% performance loss. We reduce the sample measurement time required for AMG by 90%, from 13 h to 74 min.

Download Full-text

Methodologies for the WCET Analysis of Parallel Applications on Many-Core Architectures

2015 Euromicro Conference on Digital System Design ◽

10.1109/dsd.2015.105 ◽

2015 ◽

Cited By ~ 5

Author(s):

Vincent Nelis ◽

Patrick Meumeu Yomsi ◽

Luis Miguel Pinho

Keyword(s):

Parallel Applications ◽

Wcet Analysis ◽

Many Core

Download Full-text

Experiences on the characterization of parallel applications in embedded systems with Extrae/Paraver

49th International Conference on Parallel Processing - ICPP ◽

10.1145/3404397.3404440 ◽

2020 ◽

Author(s):

Adrian Munera ◽

Sara Royuela ◽

Germán Llort ◽

Estanislao Mercadal ◽

Franck Wartel ◽

...

Keyword(s):

Embedded Systems ◽

Parallel Applications

Download Full-text

Combining Thread Throttling and Mapping to Optimize the EDP of Parallel Applications

2021 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP) ◽

10.1109/pdp52278.2021.00035 ◽

2021 ◽

Author(s):

Gustavo P. Berned ◽

Thiarles S. Medeiros ◽

Matheus Serpa ◽

Fabio D. Rossi ◽

Marcelo C. Luizelli ◽

...

Keyword(s):

Parallel Applications

Download Full-text

Scheduling Task-parallel Applications in Dynamically Asymmetric Environments

49th International Conference on Parallel Processing - ICPP : Workshops ◽

10.1145/3409390.3409408 ◽

2020 ◽

Author(s):

Jing Chen ◽

Pirah Noor Soomro ◽

Mustafa Abduljabbar ◽

Madhavan Manivannan ◽

Miquel Pericas

Keyword(s):

Parallel Applications ◽

Task Parallel

Download Full-text

Concurrency emulation and analysis of parallel applications for multi-processor system-on-chip co-design

Proceedings of the 6th IEEE/ACM/IFIP international conference on Hardware/Software codesign and system synthesis - CODES/ISSS '08 ◽

10.1145/1450135.1450138 ◽

2008 ◽

Cited By ~ 2

Author(s):

Giovanni Beltrame ◽

Luca Fossati ◽

Donatella Sciuto

Keyword(s):

System On Chip ◽

Parallel Applications ◽

On Chip

Download Full-text

Fast Key-Value Lookups with Node Tracker

ACM Transactions on Architecture and Code Optimization ◽

10.1145/3452099 ◽

2021 ◽

Vol 18 (3) ◽

pp. 1-26

Author(s):

Mustafa Cavus ◽

Mohammed Shatnawi ◽

Resit Sendag ◽

Augustus K. Uht

Keyword(s):

Data Structure ◽

Linked Data ◽

Single Thread ◽

Highly Effective ◽

Execution Unit

Lookup operations for in-memory databases are heavily memory bound, because they often rely on pointer-chasing linked data structure traversals. They also have many branches that are hard-to-predict due to random key lookups. In this study, we show that although cache misses are the primary bottleneck for these applications, without a method for eliminating the branch mispredictions only a small fraction of the performance benefit is achieved through prefetching alone. We propose the Node Tracker (NT), a novel programmable prefetcher/pre-execution unit that is highly effective in exploiting inter key-lookup parallelism to improve single-thread performance. We extend NT with branch outcome streaming (BOS) to reduce branch mispredictions and show that this achieves an extra 3× speedup. Finally, we evaluate the NT as a pre-execution unit and demonstrate that we can further improve the performance in both single- and multi-threaded execution modes. Our results show that, on average, NT improves single-thread performance by 4.1× when used as a prefetcher; 11.9× as a prefetcher with BOS; 14.9× as a pre-execution unit and 18.8× as a pre-execution unit with BOS. Finally, with 24 cores of the latter version, we achieve a speedup of 203× and 11× over the single-core and 24-core baselines, respectively.

Download Full-text