thread level parallelism Latest Research Papers

All modern-day computers and smartphones come with multi-core CPUs. The multicore architecture is generally heterogeneous in nature to maximize computational throughput. These multicore systems exploit thread-level parallelism to deliver higher performance, but they are limited by the requirement of good scheduling algorithms that maximize CPU utility and minimize wasted and idle cycles. With the rise in streaming services and multimedia capabilities of smartphones, it is necessary to have efficient heterogeneous cores which are capable of performing multimedia processing at a fast pace. It is also needed that they utilize efficient scheduling algorithms to achieve this task. This paper compares some heterogeneous multi-core scheduling algorithms available and determines which is the most optimal scheduling algorithm given various codecs.

Download Full-text

Configurable simultaneously single-threaded (multi-)engine processor

10.32920/ryerson.14644953.v1 ◽

2021 ◽

Author(s):

Anita Tino

Keyword(s):

Energy Efficiency ◽

Power Dissipation ◽

Single Thread ◽

Parallel Performance ◽

Area Overhead ◽

Expected Performance ◽

Thread Level Parallelism ◽

Performance Gains ◽

Level Parallelism ◽

Additional Area

As the multi-core computing era continues to progress, the need to increase single- thread performance, throughput, and seemingly adapt to thread-level parallelism (TLP) remain important issues. Though the number of cores on each processor continues to increase, expected performance gains have lagged. Accordingly, com- puting systems often include Simultaneously Multi-Threaded (SMT) processors as a compromise between sequential and parallel performance on a single core. These processors effectively improve the throughput and utilization of a core, however often at the expense of single-thread performance as threads per core scale. Accordingly, applications which require higher single-thread performance must often resort to single-thread core multi-processor systems which incur additional area overhead and power dissipation. In attempts to improve single- and multi-thread core efficiency, this work introduces the concept of a Configurable Simultaneously Single-Threaded (Multi-)Engine Processor (ConSSTEP). ConSSTEP is a nuanced approach to multi- threaded processors, achieving performance gains and energy efficiency by invoking low overhead reconfigurable properties with full software compatibility. Experimen- tal results demonstrate that ConSSTEP is able to increase single-thread Instruc- tions Per Cycle (IPC) up to 1.39x and 2.4x for 2-thread and 4-thread workloads, respectively, improving throughput and providing up to 2x energy efficiency when compared to a conventional SMT processor.

Download Full-text

Configurable simultaneously single-threaded (multi-)engine processor

10.32920/ryerson.14644953 ◽

2021 ◽

Author(s):

Anita Tino

Keyword(s):

Energy Efficiency ◽

Power Dissipation ◽

Single Thread ◽

Parallel Performance ◽

Area Overhead ◽

Expected Performance ◽

Thread Level Parallelism ◽

Performance Gains ◽

Level Parallelism ◽

Additional Area

As the multi-core computing era continues to progress, the need to increase single- thread performance, throughput, and seemingly adapt to thread-level parallelism (TLP) remain important issues. Though the number of cores on each processor continues to increase, expected performance gains have lagged. Accordingly, com- puting systems often include Simultaneously Multi-Threaded (SMT) processors as a compromise between sequential and parallel performance on a single core. These processors effectively improve the throughput and utilization of a core, however often at the expense of single-thread performance as threads per core scale. Accordingly, applications which require higher single-thread performance must often resort to single-thread core multi-processor systems which incur additional area overhead and power dissipation. In attempts to improve single- and multi-thread core efficiency, this work introduces the concept of a Configurable Simultaneously Single-Threaded (Multi-)Engine Processor (ConSSTEP). ConSSTEP is a nuanced approach to multi- threaded processors, achieving performance gains and energy efficiency by invoking low overhead reconfigurable properties with full software compatibility. Experimen- tal results demonstrate that ConSSTEP is able to increase single-thread Instruc- tions Per Cycle (IPC) up to 1.39x and 2.4x for 2-thread and 4-thread workloads, respectively, improving throughput and providing up to 2x energy efficiency when compared to a conventional SMT processor.

Download Full-text

Enhancing Thread-Level Parallelism in Asymmetric Multicores using Transparent Instruction Offloading

2020 57th ACM/IEEE Design Automation Conference (DAC) ◽

10.1109/dac18072.2020.9218614 ◽

2020 ◽

Author(s):

Jeckson Dellagostin Souza ◽

Madhavan Manivannan ◽

Miquel Pericas ◽

Antonio Carlos Schneider Beck

Keyword(s):

Thread Level Parallelism ◽

Level Parallelism

Download Full-text

Uma Metodologia para Reduzir o Custo de Aprendizado para Técnicas de Otimização de Aplicações Paralelas

10.5753/eradrs.2020.10780 ◽

2020 ◽

Author(s):

Gustavo Berned ◽

Arthur Lorenzon

Keyword(s):

Thread Level Parallelism ◽

Level Parallelism ◽

Energy Delay Product

A exploração do paralelismo em nível de threads (TLP - Thread Level Parallelism) tem sido amplamente utilizada para melhorar o desempenho de aplicações de diferentes domínios. Entretanto, muitas aplicações não escalam conforme o número de threads aumenta, ou seja, executar uma aplicação utilizando o máximo de threads não trará, necessariamente, o melhor resultado para tempo, energia ou EDP(Energy Delay Product), devido a questões relacionadas à hardware e Software [Raasch and Reinhardt 2003],[Lorenzon and Filho 2019]. Portanto, é preciso utilizar metodologias que consigam buscar um número ideal de threads para tais aplicações, sejam estas, online (busca enquanto a aplicação é executada) ou offline (busca antes da execução da aplicação). Entretanto, metodologias online acabam adicionando uma sobrecarga na execução da aplicação, o que não acontece nas abordagens offline [Lorenzon et al. 2018]. Com base nisto, este trabalho apresenta uma metodologia genérica para reduzir significativamente o tempo de busca pelo número de threads ideal para aplicações paralelas que utilizam a metodologia offline, inferindo o ambiente de execução das aplicações paralelas utilizando apenas pequenos conjuntos de entrada de dados.

Download Full-text

An Application based Efficient Thread Level Parallelism Scheme on Heterogeneous Multicore Embedded System for Real Time Image Processing.

Scalable Computing Practice and Experience ◽

10.12694/scpe.v21i1.1611 ◽

2020 ◽

Vol 21 (1) ◽

pp. 47-56

Author(s):

K Indragandhi ◽

Jawahar P K

Keyword(s):

Image Processing ◽

Edge Detection ◽

Embedded System ◽

Real Time ◽

Execution Time ◽

Image Size ◽

Multicore Processor ◽

Detection Scheme ◽

Thread Level Parallelism ◽

Level Parallelism

The recent advent of the embedded devices is equipped with multicore processor as it significantly improves the system performance. In order to utilize all the core in multicore processor in an efficient manner, application programs need to be parallelized. An efficient thread level parallelism (ETLP) scheme is proposed in this paper and uses computationally intensive edge detection algorithm for evaluation. Edge detection is the important process in various real time applications namely vehicle detection in traffic control, medical image processing etc. The main objective of ETLP scheme is to reduce the execution time and increase the CPU core utilization. The performance of ETLP scheme is evaluated with basic edge detection scheme (BEDS) for different image size. The experimental results reveal that the proposed ETLP scheme achieves efficiency of 49% and 72% for the image size 300 x 256 and 1024 x 1024 respectively. Furthermore an ETLP scheme reducing 66% execution time for image size 1024 x 1024 when compared with BEDS.

Download Full-text

Parallel Algorithms for Discovering Planted (l, d) Motif

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.d1521.029420 ◽

2020 ◽

Vol 9 (4) ◽

pp. 1452-1461

Keyword(s):

Parallel Algorithms ◽

Search Algorithm ◽

Computational Techniques ◽

Search Problem ◽

Biological Sequences ◽

Motif Search ◽

Thread Level Parallelism ◽

And Mathematics ◽

Level Parallelism ◽

Process Level

In computational biology, motifs are short, recurring patterns of biological sequences that possess the principal character for the analysis and interpretation of various biological issues like human disease, gene function, drug design, etc. The major objectives of the motif search problem are the management, analysis, and interpretation of huge biological sequences using computational techniques from computer science and mathematics. However, detection of the motif leads to computational problems whose solutions require a substantial amount of time in one uniprocessor machine and thus, remains as one challenging problem. In this chapter, two parallel algorithms are proposed, along with its implementation detail which crucially enhances the performance of the PMSP motif search algorithm. The first approach enhances the existing algorithm by eliminating the redundant process of the computation and also, minimizes the execution time by the use of both process-level and thread-level parallelism in the implementation. The second approach is the improvement over the first one, where not only the time of computation is reduced further but also the best space utilization is achieved..

Download Full-text

A Parallelization of Non-Serial Polyadic Dynamic Programming on GPU

Journal of Computing and Information Technology ◽

10.20532/cit.2019.1004579 ◽

2019 ◽

Vol 27 (2) ◽

pp. 55-66

Keyword(s):

Dynamic Programming ◽

Computational Complexity ◽

State Of The Art ◽

Mapping Method ◽

Computing Power ◽

Processing Elements ◽

Load Imbalance ◽

Input Size ◽

Thread Level Parallelism ◽

Level Parallelism

Parallelization of Non-Serial Polyadic Dynamic Programming (NPDP) on high-throughput manycore architectures, such as NVIDIA GPUs, suffers from load imbalance, i.e. non-optimal mapping between the sub-problems of NPDP and the processing elements of the GPU. NPDP exhibits non-uniformity in the number of subproblems as well as computational complexity across the phases. In NPDP parallelization, phases are computed sequentially whereas subproblems of each phase are computed concurrently. Therefore, it is essential to effectively map the subproblems of each phase to the processing elements while implementing thread level parallelism. We propose an adaptive Generalized Mapping Method (GMM) for NPDP parallelization that utilizes the GPU for efficient mapping of subproblems onto processing threads in each phase. Input-size and targeted GPU decide the computing power and the best mapping for each phase in NPDP parallelization. The performance of GMM is compared with different conventional parallelization approaches. For sufficiently large inputs, our technique outperforms the state-of-the-art conventional parallelization approach and achieves a significant speedup of a factor 30. We also summarize the general heuristics for achieving better gain in the NPDP parallelization.

Download Full-text

Poise: Balancing Thread-Level Parallelism and Memory System Performance in GPUs Using Machine Learning

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA) ◽

10.1109/hpca.2019.00061 ◽

2019 ◽

Cited By ~ 2

Author(s):

Saumay Dublish ◽

Vijay Nagarajan ◽

Nigel Topham

Keyword(s):

Machine Learning ◽

System Performance ◽

Memory System ◽

Thread Level Parallelism ◽

Level Parallelism

Download Full-text

Improving Thread-level Parallelism in GPUs Through Expanding Register File to Scratchpad Memory

ACM Transactions on Architecture and Code Optimization ◽

10.1145/3280849 ◽

2019 ◽

Vol 15 (4) ◽

pp. 1-24

Author(s):

Chao Yu ◽

Yuebin Bai ◽

Qingxiao Sun ◽

Hailong Yang

Keyword(s):

Register File ◽

Scratchpad Memory ◽

Thread Level Parallelism ◽

Level Parallelism

Download Full-text

thread level parallelism
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Comparative Study of Heterogeneous Multicore Scheduling Algorithms on Media Codecs

Configurable simultaneously single-threaded (multi-)engine processor

Configurable simultaneously single-threaded (multi-)engine processor

Enhancing Thread-Level Parallelism in Asymmetric Multicores using Transparent Instruction Offloading

Uma Metodologia para Reduzir o Custo de Aprendizado para Técnicas de Otimização de Aplicações Paralelas

An Application based Efficient Thread Level Parallelism Scheme on Heterogeneous Multicore Embedded System for Real Time Image Processing.

Parallel Algorithms for Discovering Planted (l, d) Motif

A Parallelization of Non-Serial Polyadic Dynamic Programming on GPU

Poise: Balancing Thread-Level Parallelism and Memory System Performance in GPUs Using Machine Learning

Improving Thread-level Parallelism in GPUs Through Expanding Register File to Scratchpad Memory

Export Citation Format

thread level parallelismRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Comparative Study of Heterogeneous Multicore Scheduling Algorithms on Media Codecs

Configurable simultaneously single-threaded (multi-)engine processor

Configurable simultaneously single-threaded (multi-)engine processor

Enhancing Thread-Level Parallelism in Asymmetric Multicores using Transparent Instruction Offloading

Uma Metodologia para Reduzir o Custo de Aprendizado para Técnicas de Otimização de Aplicações Paralelas

An Application based Efficient Thread Level Parallelism Scheme on Heterogeneous Multicore Embedded System for Real Time Image Processing.

Parallel Algorithms for Discovering Planted (l, d) Motif

A Parallelization of Non-Serial Polyadic Dynamic Programming on GPU

Poise: Balancing Thread-Level Parallelism and Memory System Performance in GPUs Using Machine Learning

Improving Thread-level Parallelism in GPUs Through Expanding Register File to Scratchpad Memory

thread level parallelism
Recently Published Documents