Uma Metodologia para Reduzir o Custo de Aprendizado para Técnicas de Otimização de Aplicações Paralelas

Level Parallelism ◽

Energy Delay Product

A exploração do paralelismo em nível de threads (TLP - Thread Level Parallelism) tem sido amplamente utilizada para melhorar o desempenho de aplicações de diferentes domínios. Entretanto, muitas aplicações não escalam conforme o número de threads aumenta, ou seja, executar uma aplicação utilizando o máximo de threads não trará, necessariamente, o melhor resultado para tempo, energia ou EDP(Energy Delay Product), devido a questões relacionadas à hardware e Software [Raasch and Reinhardt 2003],[Lorenzon and Filho 2019]. Portanto, é preciso utilizar metodologias que consigam buscar um número ideal de threads para tais aplicações, sejam estas, online (busca enquanto a aplicação é executada) ou offline (busca antes da execução da aplicação). Entretanto, metodologias online acabam adicionando uma sobrecarga na execução da aplicação, o que não acontece nas abordagens offline [Lorenzon et al. 2018]. Com base nisto, este trabalho apresenta uma metodologia genérica para reduzir significativamente o tempo de busca pelo número de threads ideal para aplicações paralelas que utilizam a metodologia offline, inferindo o ambiente de execução das aplicações paralelas utilizando apenas pequenos conjuntos de entrada de dados.

Proceedings of the 36th annual international symposium on Computer architecture - ISCA '09 ◽

An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

10.1145/1555754.1555775 ◽

2009 ◽

Cited By ~ 256

Author(s):

Sunpyo Hong ◽

Hyesoon Kim

Keyword(s):

Analytical Model ◽

Level Parallelism ◽

Gpu Architecture ◽

With Memory

Thread partitioning and value prediction for exploiting speculative thread-level parallelism

IEEE Transactions on Computers ◽

10.1109/tc.2004.1261823 ◽

2004 ◽

Vol 53 (2) ◽

pp. 114-125 ◽

Cited By ~ 11

Author(s):

P. Marcuello ◽

A. Gonzalez ◽

J. Tubella

Keyword(s):

Value Prediction ◽

Thread Partitioning ◽

GPU Performance vs. Thread-Level Parallelism

ACM Transactions on Architecture and Code Optimization ◽

10.1145/3177964 ◽

2018 ◽

Vol 15 (1) ◽

pp. 1-21 ◽

Cited By ~ 4

Author(s):

Zhen Lin ◽

Michael Mantor ◽

Huiyang Zhou

Keyword(s):

CaLRS: A Critical-Aware Shared LLC Request Scheduling Algorithm on GPGPU

The Scientific World JOURNAL ◽

10.1155/2015/848416 ◽

2015 ◽

Vol 2015 ◽

pp. 1-10

Author(s):

Jianliang Ma ◽

Jinglei Meng ◽

Tianzhou Chen ◽

Minghui Wu

Keyword(s):

Scheduling Algorithm ◽

Global Memory ◽

Request Sequence ◽

Level Parallelism ◽

Memory Request ◽

Request Service

Ultra high thread-level parallelism in modern GPUs usually introduces numerous memory requests simultaneously. So there are always plenty of memory requests waiting at each bank of the shared LLC (L2 in this paper) and global memory. For global memory, various schedulers have already been developed to adjust the request sequence. But we find few work has ever focused on the service sequence on the shared LLC. We measured that a big number of GPU applications always queue at LLC bank for services, which provide opportunity to optimize the service order on LLC. Through adjusting the GPU memory request service order, we can improve the schedulability of SM. So we proposed a critical-aware shared LLC request scheduling algorithm (CaLRS) in this paper. The priority representative of memory request is critical for CaLRS. We use the number of memory requests that originate from the same warp but have not been serviced when they arrive at the shared LLC bank to represent the criticality of each warp. Experiments show that the proposed scheme can boost the SM schedulability effectively by promoting the scheduling priority of the memory requests with high criticality and improves the performance of GPU indirectly.

Exploiting Thread-Level Parallelism of Irregular LDPC Decoder with Simultaneous Multi-threading Technique

Lecture Notes in Computer Science - Advanced Parallel Processing Technologies ◽

10.1007/978-3-540-76837-1_70 ◽

2007 ◽

pp. 650-657 ◽

Cited By ~ 1

Author(s):

Xing Fang ◽

Dong Wang ◽

Shuming Chen

Keyword(s):

Ldpc Decoder ◽

Executing linear algebra kernels in heterogeneous distributed infrastructures with PyCOMPSs

Oil & Gas Science and Technology – Revue d’IFP Energies nouvelles ◽

10.2516/ogst/2018047 ◽

2018 ◽

Vol 73 ◽

pp. 47 ◽

Cited By ~ 3

Author(s):

Ramon Amela ◽

Cristian Ramon-Cortes ◽

Jorge Ejarque ◽

Javier Conejero ◽

Rosa M. Badia

Keyword(s):

Programming Languages ◽

Linear Algebra ◽

Programming Model ◽

Xeon Phi ◽

Scientific Communities ◽

Heterogeneous Architectures ◽

Parallel Programming Model ◽

Significant Performance ◽

Python is a popular programming language due to the simplicity of its syntax, while still achieving a good performance even being an interpreted language. The adoption from multiple scientific communities has evolved in the emergence of a large number of libraries and modules, which has helped to put Python on the top of the list of the programming languages [1]. Task-based programming has been proposed in the recent years as an alternative parallel programming model. PyCOMPSs follows such approach for Python, and this paper presents its extensions to combine task-based parallelism and thread-level parallelism. Also, we present how PyCOMPSs has been adapted to support heterogeneous architectures, including Xeon Phi and GPUs. Results obtained with linear algebra benchmarks demonstrate that significant performance can be obtained with a few lines of Python.