Mitigating execution unit contention in parallel applications using instruction‐aware mapping

Author(s):  
Matheus S. Serpa ◽  
Eduardo H. M. Cruz ◽  
Matthias Diener ◽  
Arthur F. Lorenzon ◽  
Antonio C. S. Beck ◽  
...  
2012 ◽  
Vol 17 (4) ◽  
pp. 207-216 ◽  
Author(s):  
Magdalena Szymczyk ◽  
Piotr Szymczyk

Abstract The MATLAB is a technical computing language used in a variety of fields, such as control systems, image and signal processing, visualization, financial process simulations in an easy-to-use environment. MATLAB offers "toolboxes" which are specialized libraries for variety scientific domains, and a simplified interface to high-performance libraries (LAPACK, BLAS, FFTW too). Now MATLAB is enriched by the possibility of parallel computing with the Parallel Computing ToolboxTM and MATLAB Distributed Computing ServerTM. In this article we present some of the key features of MATLAB parallel applications focused on using GPU processors for image processing.


Author(s):  
Mark Endrei ◽  
Chao Jin ◽  
Minh Ngoc Dinh ◽  
David Abramson ◽  
Heidi Poxon ◽  
...  

Rising power costs and constraints are driving a growing focus on the energy efficiency of high performance computing systems. The unique characteristics of a particular system and workload and their effect on performance and energy efficiency are typically difficult for application users to assess and to control. Settings for optimum performance and energy efficiency can also diverge, so we need to identify trade-off options that guide a suitable balance between energy use and performance. We present statistical and machine learning models that only require a small number of runs to make accurate Pareto-optimal trade-off predictions using parameters that users can control. We study model training and validation using several parallel kernels and more complex workloads, including Algebraic Multigrid (AMG), Large-scale Atomic Molecular Massively Parallel Simulator, and Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics. We demonstrate that we can train the models using as few as 12 runs, with prediction error of less than 10%. Our AMG results identify trade-off options that provide up to 45% improvement in energy efficiency for around 10% performance loss. We reduce the sample measurement time required for AMG by 90%, from 13 h to 74 min.


Author(s):  
Adrian Munera ◽  
Sara Royuela ◽  
Germán Llort ◽  
Estanislao Mercadal ◽  
Franck Wartel ◽  
...  

Author(s):  
Gustavo P. Berned ◽  
Thiarles S. Medeiros ◽  
Matheus Serpa ◽  
Fabio D. Rossi ◽  
Marcelo C. Luizelli ◽  
...  

Author(s):  
Jing Chen ◽  
Pirah Noor Soomro ◽  
Mustafa Abduljabbar ◽  
Madhavan Manivannan ◽  
Miquel Pericas

2021 ◽  
Vol 18 (3) ◽  
pp. 1-26
Author(s):  
Mustafa Cavus ◽  
Mohammed Shatnawi ◽  
Resit Sendag ◽  
Augustus K. Uht

Lookup operations for in-memory databases are heavily memory bound, because they often rely on pointer-chasing linked data structure traversals. They also have many branches that are hard-to-predict due to random key lookups. In this study, we show that although cache misses are the primary bottleneck for these applications, without a method for eliminating the branch mispredictions only a small fraction of the performance benefit is achieved through prefetching alone. We propose the Node Tracker (NT), a novel programmable prefetcher/pre-execution unit that is highly effective in exploiting inter key-lookup parallelism to improve single-thread performance. We extend NT with branch outcome streaming (BOS) to reduce branch mispredictions and show that this achieves an extra 3× speedup. Finally, we evaluate the NT as a pre-execution unit and demonstrate that we can further improve the performance in both single- and multi-threaded execution modes. Our results show that, on average, NT improves single-thread performance by 4.1× when used as a prefetcher; 11.9× as a prefetcher with BOS; 14.9× as a pre-execution unit and 18.8× as a pre-execution unit with BOS. Finally, with 24 cores of the latter version, we achieve a speedup of 203× and 11× over the single-core and 24-core baselines, respectively.


2001 ◽  
Vol 17 (6) ◽  
pp. 769-782 ◽  
Author(s):  
Aske Plaat ◽  
Henri E. Bal ◽  
Rutger F.H. Hofman ◽  
Thilo Kielmann

Sign in / Sign up

Export Citation Format

Share Document