Is there exploitable thread-level parallelism in general-purpose application programs?

Limits of instruction-level parallelism and higher transistor density sustain the increasing need for multiprocessor systems: they are rapidly taking over both general-purpose and embedded processor domains. Current multiprocessing systems are composed either of many homogeneous and simple cores or of complex superscalar, simultaneous multithread processing elements. As parallel applications are becoming increasingly present in embedded and general-purpose domains and multiprocessing systems must handle a wide range of different application classes, there is no consensus over which are the best hardware solutions to better exploit instruction-level parallelism (TLP) and thread-level parallelism (TLP) together. Therefore, in this work, we have expanded the DIM (dynamic instruction merging) technique to be used in a multiprocessing scenario, proving the need for an adaptable ILP exploitation even in TLP architectures. We have successfully coupled a dynamic reconfigurable system to an SPARC-based multiprocessor and obtained performance gains of up to 40%, even for applications that show a great level of parallelism at thread level.

Download Full-text

An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

Proceedings of the 36th annual international symposium on Computer architecture - ISCA '09 ◽

10.1145/1555754.1555775 ◽

2009 ◽

Cited By ~ 256

Author(s):

Sunpyo Hong ◽

Hyesoon Kim

Keyword(s):

Analytical Model ◽

Thread Level Parallelism ◽

Level Parallelism ◽

Gpu Architecture ◽

With Memory

Download Full-text

Thread partitioning and value prediction for exploiting speculative thread-level parallelism

IEEE Transactions on Computers ◽

10.1109/tc.2004.1261823 ◽

2004 ◽

Vol 53 (2) ◽

pp. 114-125 ◽

Cited By ~ 11

Author(s):

P. Marcuello ◽

A. Gonzalez ◽

J. Tubella

Keyword(s):

Value Prediction ◽

Thread Level Parallelism ◽

Thread Partitioning ◽

Level Parallelism

Download Full-text

GPU Performance vs. Thread-Level Parallelism

ACM Transactions on Architecture and Code Optimization ◽

10.1145/3177964 ◽

2018 ◽

Vol 15 (1) ◽

pp. 1-21 ◽

Cited By ~ 4

Author(s):

Zhen Lin ◽

Michael Mantor ◽

Huiyang Zhou

Keyword(s):

Thread Level Parallelism ◽

Level Parallelism

Download Full-text

CaLRS: A Critical-Aware Shared LLC Request Scheduling Algorithm on GPGPU

The Scientific World JOURNAL ◽

10.1155/2015/848416 ◽

2015 ◽

Vol 2015 ◽

pp. 1-10

Author(s):

Jianliang Ma ◽

Jinglei Meng ◽

Tianzhou Chen ◽

Minghui Wu

Keyword(s):

Scheduling Algorithm ◽

Global Memory ◽

Request Sequence ◽

Thread Level Parallelism ◽

Level Parallelism ◽

Memory Request ◽

Request Service

Ultra high thread-level parallelism in modern GPUs usually introduces numerous memory requests simultaneously. So there are always plenty of memory requests waiting at each bank of the shared LLC (L2 in this paper) and global memory. For global memory, various schedulers have already been developed to adjust the request sequence. But we find few work has ever focused on the service sequence on the shared LLC. We measured that a big number of GPU applications always queue at LLC bank for services, which provide opportunity to optimize the service order on LLC. Through adjusting the GPU memory request service order, we can improve the schedulability of SM. So we proposed a critical-aware shared LLC request scheduling algorithm (CaLRS) in this paper. The priority representative of memory request is critical for CaLRS. We use the number of memory requests that originate from the same warp but have not been serviced when they arrive at the shared LLC bank to represent the criticality of each warp. Experiments show that the proposed scheme can boost the SM schedulability effectively by promoting the scheduling priority of the memory requests with high criticality and improves the performance of GPU indirectly.

Download Full-text

Exploiting Thread-Level Parallelism of Irregular LDPC Decoder with Simultaneous Multi-threading Technique

Lecture Notes in Computer Science - Advanced Parallel Processing Technologies ◽

10.1007/978-3-540-76837-1_70 ◽

2007 ◽

pp. 650-657 ◽

Cited By ~ 1

Author(s):

Xing Fang ◽

Dong Wang ◽

Shuming Chen

Keyword(s):

Ldpc Decoder ◽

Thread Level Parallelism ◽

Level Parallelism

Download Full-text

Adjusting Thread Parallelism Dynamically to Accelerate Dynamic Programming with Irregular Workload Distribution on GPGPUs

International Journal of Grid and High Performance Computing ◽

10.4018/ijghpc.2014010101 ◽

2014 ◽

Vol 6 (1) ◽

pp. 1-20 ◽

Cited By ~ 8

Author(s):

Chao-Chin Wu ◽

Jenn-Yang Ke ◽

Heshan Lin ◽

Syun-Sheng Jhan

Keyword(s):

Dynamic Programming ◽

Inventory Management ◽

Optimization Problems ◽

Optimization Technique ◽

General Purpose ◽

Processing Unit ◽

Computing Power ◽

Workload Distribution ◽

The Difference ◽

Level Parallelism

Dynamic Programming (DP) is an important and popular method for solving a wide variety of discrete optimization problems such as scheduling, string-editing, packaging, and inventory management. DP breaks problems into simpler subproblems and combines their solutions into solutions to original ones. This paper focuses on one type of dynamic programming called Nonserial Polyadic Dynamic Programming (NPDP). To run NPDP applications efficiently on an emerging General-Purpose Graphic Processing Unit (GPGPU), the authors have to exploit more parallelism to fully utilize the computing power of the hundreds of processing units in it. However, the parallelism degree varies significantly in different phases of the NPDP applications. To address the problem, the authors propose a method that can adjust the thread-level parallelism to provide a sufficient and steadier parallelism degree for different phases. If a phase has insufficient parallelism, the authors split threads into subthreads. On the other hand, the authors can limit the total number of threads in a phase by merging threads. The authors also examine the difference between the conventional problem of finding the minimum on a GPU and the NPDP-featured problem of finding the minimums of many independent sets on a GPU. Finally, the authors examine how to design an appropriate data structure to apply the memory coalescing optimization technique. The experimental results demonstrate our method can obtain the best speedup of 13.40 over the algorithm published previously.

Download Full-text

Executing linear algebra kernels in heterogeneous distributed infrastructures with PyCOMPSs

Oil & Gas Science and Technology – Revue d’IFP Energies nouvelles ◽

10.2516/ogst/2018047 ◽

2018 ◽

Vol 73 ◽

pp. 47 ◽

Cited By ~ 3

Author(s):

Ramon Amela ◽

Cristian Ramon-Cortes ◽

Jorge Ejarque ◽

Javier Conejero ◽

Rosa M. Badia

Keyword(s):

Programming Languages ◽

Linear Algebra ◽

Programming Model ◽

Xeon Phi ◽

Scientific Communities ◽

Heterogeneous Architectures ◽

Parallel Programming Model ◽

Significant Performance ◽

Thread Level Parallelism ◽

Level Parallelism

Python is a popular programming language due to the simplicity of its syntax, while still achieving a good performance even being an interpreted language. The adoption from multiple scientific communities has evolved in the emergence of a large number of libraries and modules, which has helped to put Python on the top of the list of the programming languages [1]. Task-based programming has been proposed in the recent years as an alternative parallel programming model. PyCOMPSs follows such approach for Python, and this paper presents its extensions to combine task-based parallelism and thread-level parallelism. Also, we present how PyCOMPSs has been adapted to support heterogeneous architectures, including Xeon Phi and GPUs. Results obtained with linear algebra benchmarks demonstrate that significant performance can be obtained with a few lines of Python.

Download Full-text

A study of Thread Level Parallelism on mobile devices

2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) ◽

10.1109/ispass.2014.6844468 ◽

2014 ◽

Cited By ~ 13

Author(s):

Cao Gao ◽

Anthony Gutierrez ◽

Ronald G. Dreslinski ◽

Trevor Mudge ◽

Krisztian Flautner ◽

...

Keyword(s):

Mobile Devices ◽

Thread Level Parallelism ◽

Level Parallelism

Download Full-text

FuMicro: A Fused Microarchitecture Design Integrating In-Order Superscalar and VLIW

VLSI Design ◽

10.1155/2016/8787919 ◽

2016 ◽

Vol 2016 ◽

pp. 1-12 ◽

Cited By ~ 3

Author(s):

Yumin Hou ◽

Hu He ◽

Xu Yang ◽

Deyuan Guo ◽

Xu Wang ◽

...

Keyword(s):

Digital Signal ◽

General Purpose ◽

Instruction Level Parallelism ◽

Instruction Set ◽

Mode Switch ◽

Development Environment ◽

General Purpose Processor ◽

Improve Instruction ◽

Library Function ◽

Level Parallelism

This paper proposes FuMicro, a fused microarchitecture integrating both in-order superscalar and Very Long Instruction Word (VLIW) in a single core. A processor with FuMicro microarchitecture can work under alternative in-order superscalar and VLIW mode, using the same pipeline and the same Instruction Set Architecture (ISA). Small modification to the compiler is made to expand the register file in VLIW mode. The decision of mode switch is made by software, and this does not need extra hardware. VLIW code can be exploited in the form of library function and the users will be exposed under only superscalar mode; by this means, we can provide the users with a convenient development environment. FuMicro could serve as a universal microarchitecture for it can be applied to different ISAs. In this paper, we focus on the implementation of FuMicro with ARM ISA. This architecture is evaluated on gem5, which is a cycle accurate microarchitecture simulation platform. By adopting FuMicro microarchitecture, the performance can be improved on an average of 10%, with the best performance improvement being 47.3%, compared with that under pure in-order superscalar mode. The result shows that FuMicro microarchitecture can improve Instruction Level Parallelism (ILP) significantly, making it promising to expand digital signal processing capability on a General Purpose Processor.

Download Full-text